A VIAF Reconciliation Service for OpenRefine


OpenRefine is a wonderful tool my coworkers have been using to clean data for my project at work. Our workflow has been nice and simple: they take a CSV dump from a database, transform the data in OpenRefine, and export it as CSV. I write scripts to detect the changes and update the database with the new data.

We have a need, in the next few months, to reconcile the names of various individuals and organizations with standard “universal” identifiers for them in the Virtual International Authority File. The tricky part is that any given name in our system might have several candidates in VIAF, so it can’t be a fully automated process. A human being needs to look at them and make a decision. OpenRefine allows you to do this reconciliation, and also provides an interface that lets you choose among candidates.

Communicating with VIAF is not built in, though. Roderic D. M. Page wrote a VIAF reconciliation service, and it’s publicly accessible at the address listed on the linked page (the PHP source code is available here). It works very nicely.

I wanted to write my own version for 2 reasons: 1) I needed it to support the different name types in VIAF, 2) I wanted to host it myself, in case I needed to make large numbers of queries, so as not to be an obnoxious burden on Page’s server.

The project is called refine_viaf and the source code is available at https://github.com/codeforkjeff/refine_viaf.

For those who just want to use it without hosting their own installation, I’ve also made the service publicly accessible at http://refine.codefork.com, where there are instructions on how to configure OpenRefine to use it.

Arriving Late to the Party


I started playing with Ruby this past week, at the suggestion of a coworker I respect who thinks highly of the language, and of Rails. I’ve found Programming Ruby 1.9 & 2.0 to be an excellent way to get started.

Now, it’s only been a week, but so far? I really love it.

This surprised me.

Some first thoughts and impressions, subject to change:

1) Ruby is VERY expressive. I’m astounded by what you can do in a few short lines of code. I wasn’t initially thrilled about some of its uses of character symbols, obviously borrowed from Perl, but these are actually pretty judicious and not as bad as I expected.

2) There’s a commonplace idea that Python and Ruby are redundant: if you know one, there’s not a lot of reason to learn the other. I can see this, as they do have many overlapping features and roles, but their philosophies could not be more different. So from an experiential standpoint rather than a business one, they’re both worth learning.

3) Things that Ruby does better than Python: the object orientation is stronger (access controls, single inheritance with a powerful mixin mechanism, most[?] operators are methods); blocks are way more powerful than lambdas; regexes are native constructs; Ruby’s symbols are a nice feature taken from Lisp.

4) I hate that ‘require’ autoimports things into your namespace, like Perl does. This is where I much prefer Python’s ‘import’ statement. For the love of God, DO NOT TOUCH MY NAMESPACE unless I say so. (EDIT: this is actually incorrect! Files that get required typically define modules, which are constants in a global namespace, which is quite different from autoimporting names into the current namespace, like Perl.)

5) The fact that strings are mutable is one of those things that filled me with horror, but I’m not sure it actually matters that much in practice.

6) If Why’s Poignant Guide to Ruby doesn’t inspire you to learn and use Ruby, you may not be human.

As a learning exercise, I rewrote a short Python script that watches RSS feeds for certain keywords. It’s available on github. Even as a newcomer (albeit one with Perl and Python experience), writing the code was intuitive and painless. In fact, it was a strangely pleasing experience. The code could be improved, I’m sure, but arriving at a decent, clean first pass wasn’t hard to do at all.

This feels like a good time to look at Ruby. The initial frenzy and excitement over Ruby, and Rails, in the mid-to-late 2000s has died down considerably, displaced by the current wave of languages focusing on concurrency. Yet its performance, once a huge blight, has been quietly improving from 1.8 to 1.9 to 2.0, and the ecosystem (gems, rbenv, rake) seems very mature.

Now that it’s no longer sexy, it can focus on doing what it was designed for in the first place: making programmers happier to be writing code. Languages these days are emphasizing features and power, but how many actually tout your happiness as its primary reason for existence?