A VIAF Reconciliation Service for OpenRefine

open-refine

OpenRefine is a wonderful tool my coworkers have been using to clean data for my project at work. Our workflow has been nice and simple: they take a CSV dump from a database, transform the data in OpenRefine, and export it as CSV. I write scripts to detect the changes and update the database with the new data.

We have a need, in the next few months, to reconcile the names of various individuals and organizations with standard “universal” identifiers for them in the Virtual International Authority File. The tricky part is that any given name in our system might have several candidates in VIAF, so it can’t be a fully automated process. A human being needs to look at them and make a decision. OpenRefine allows you to do this reconciliation, and also provides an interface that lets you choose among candidates.

Communicating with VIAF is not built in, though. Roderic D. M. Page wrote a VIAF reconciliation service, and it’s publicly accessible at the address listed on the linked page (the PHP source code is available here). It works very nicely.

I wanted to write my own version for 2 reasons: 1) I needed it to support the different name types in VIAF, 2) I wanted to host it myself, in case I needed to make large numbers of queries, so as not to be an obnoxious burden on Page’s server.

The project is called refine_viaf and the source code is available at https://github.com/codeforkjeff/refine_viaf.

For those who just want to use it without hosting their own installation, I’ve also made the service publicly accessible at http://refine.codefork.com, where there are instructions on how to configure OpenRefine to use it.

Goodbye, Sublime Text?

When one of my coworkers started using Sublime Text about a year ago, I was intrigued. I played with it and found it to be a very featureful and speedy editor. I wasn’t compelled enough to make the switch from Emacs, though. (You’ll pry it from my cold dead hands!) But I really liked the fact that you could write plugins for it in Python.

So for fun, I gradually ported my emacs library, which integrates with a bunch of custom development tools at work, to Sublime Text. It works very well, and the ST users in the office have been happy with it. Although I don’t actually use ST regularly, I’ve since been following news about its development.

What I discovered is that many of its users are unhappy with the price tag and dissatisfied with the support they received via the forums. So much so, in fact, that there’s now an attempt to create an open source clone by reverse engineering it. The project is named lime.

I learned about this with very mixed feelings. There’s a good chance the project will take off, given how much frustration exists with ST. Of course, the trend is nothing new: open source software has been supplanting closed source commercial software for a long time now. But this isn’t Microsoft or Oracle we’re talking about; it’s a very small company, charging what I think is a reasonable amount of money for their product. While they undoubtedly could do more to make their users happier, I imagine that they probably can’t do so without hurting what I imagine are pretty slim profit margins. That, or not sleeping ever again.

It’s not news that making a software product is much less viable than it used to be. Where money is made, it’s increasingly through consulting and customization, but one wonders about the size of that market.

It’s generally a good thing that open source has “socialized” software development: technology has enabled communities of programmers to contribute and collaborate on a large scale, in a highly distributed fashion, to create good quality software available to all, taking it out of the profit equation. The problem is that the rest of the economy hasn’t caught up with this new kind of economics.

I don’t mean to sound dramatic: there are many jobs out there for programmers, of course. But it saddens me that if you want to try to create a product to sell, it’s simply not enough to have a good idea anymore, in this day and age. It has to be dirt cheap or free, you have to respond to every message immediately, and respect every single feature request. Between the open source world and the big software companies that service corporate customers, there is a vast middle ground of small companies that is quickly vanishing.

Looking at Go

In the latest stage of my exploration/deepening of programming knowledge, I’ve been looking at Go.

There’s got to be something that piques my intellectual curiosity or solves a specific problem for me to want to learn a new language. Not much about the latest “hot” languages like Ruby, Scala, and Erlang appeals to me, so I haven’t bothered with them. In real world work, I like Python as a general purpose language, and I like Java (seriously!) for large projects that need the strong tooling and frameworks available for it. Lisp and Clojure have provided useful perspective and food for thought, but in practice, they haven’t found a place in the real world software I write. Everything else I tolerate only because I have to (I’m looking at you, Javascript).

Go is extremely intriguing. It strikes me as combining some of the best things about Python and Java. It would be great not to have to choose! I like the simple syntax (not as simple as Python, alas!), the static typing, the fact that it’s compiled, and the general philosophy of favoring composition over inheritance, an idea I’ve come to support more and more. In a world currently dominated by highly dynamic, interpreted languages with very loose typing systems and a hierarchical object oriented paradigm, Go is incredibly unique! Follow the trend of languages like Clojure, Go has concurrency features that take strong advantage of multicore computing, except that its concurrency mechanisms seem much simpler. I’ve started to look at code samples and play with it a bit, and I really like what I see so far.

There’s actually a lot of negative discussions of Go on the web, but most of them are about the language in its messy pre-1.0 state. The March 1.0 release has supposedly tightened up a lot of things, and of course, performance will only get better, now that the fundamental semantics and features are solidly in place. This is an exciting time for what feels like the next evolutionary step in programming languages.

Monitoring DSL diagnostics on a ZyXEL P-600 Series modem

The DSL at the house has been really flakey the past 3 days. The line seems to periodically drop and I also noticed that the voice line had a lot of static.

I suspected a line problem so I called Earthlink support last night to try to get it straightened out. Surprisingly, the line tested good up until the point where it came into the house. But the diagnostics on the DSL modem were showing low noise margin (signal-to-noise ratio) and high attenuation (degradation), so there was definitely a problem somewhere. The trouble had to be inside the house: either the DSL modem or some wiring somewhere had gone bad or both.

I tried changing out some cables and tweaking the way the phone and fax (my housemate runs her own business) were all hooked up into the line. That seems to have fixed the problem. The line’s been steady for almost 24 hours now. I’ve been watching the diagnostics and researching what the numbers mean, and they seem to be within acceptable-to-good ranges. So far so good.

It was annoying to have to load the web interface on the ZyXEL P-600 Series modem (it’s a P-660R-ELNK) and continually refresh the page to watch the diagnostics. Plus I had to remember if any of the numbers changed and by how much. So I whipped up a little python script to fetch the data from the web interface and do some logging.

There were some interesting peculiarities in the ZyXEL web interface. After authenticating with a password, only that computer’s IP can use the interface, apparently; requests from other hosts get locked out until a few minutes of inactivity. Also, the way the interface refreshed the diagnostics page was a bit odd: it used a form submission to set some state that would cause another page to update its contents.

The script output looks like this:

Sun Jan 17 19:15:37 2010 noise = 16 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:16:37 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:17:38 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:18:38 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:19:38 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:20:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:21:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:22:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:23:39 2010 noise = 18 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:24:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)

A copy of the script is available here. You may need to do some tweaking to get it to work with your setup. It works with python 2.5 and 2.6.

FixedGearGallery Index 2.0

I created a new interface for my FixedGearGallery Index. What better way to procrastinate than spending a few hours on code?

The original purpose of the index was to provide an easy way to browse through the relevant pages of a particular make/model on FGG. My first version accomplished that goal, but it’s awfully clunky. After using it a while, I discovered how annoying it was to toggle between windows and keep track of where I was in the list.

The new version places navigation controls in a small area at the top of the page. It loads content from FGG into an iframe, eliminating the need for switching among windows. And the previous/next links allow you to browse sequentially, making it much easier to keep track of what you’ve already seen.

It’s not perfect but it’s definitely an improvement. I have fancier ideas for organizing FGG content but I don’t want to go too far by pirating Dennis’ site. I’m grateful he gave me permission to do the index at all when I emailed him about it a few months ago.