Thursday, August 6, 2009

Dealing with humans' (names)

Recently I hit a slight snag on a fairly common problem... dealing with names. This is a problematic area, given that everyone has one, and trying to build in what we know about names into software is actually a bit of a slog!

What I'm doing is trying to parse names, (mainly author names), for txtckr, so that one of the output display formats could be a reference, (APA, for example). To do this, I also need to untangle the "rft.au" information which is delivered through OpenURL, and I'm trying to build in some "forgiveness" to allow for people/companies that don't follow spec's properly!

Things to consider:
  • with a full name, is it supplied first-name(s) last-name/surname, and if so, where does the surname begin? This is fine for a fair number of relatively simple names, but what about surnames which aren't, such as "van der Weerden"?
  • if you're going to receive name fragments, how do you build these sensibly into software, so you can give permutations of the name, e.g. Pasley, Tom == Pasley, T. == Tom Pasley == T. Pasley?
No doubt I'm not the first person to tackle this problem, and I'm probably over-thinking things slightly, but I'm open to tips about projects that/from anyone else who's tackled this problem...