Category Archives: software

Announcing conciliator

I’ve just created a github repository for conciliator, a growing collection of OpenRefine reconciliation services, as well as a Java framework for creating them.

conciliator is a major refactoring of my refine_viaf project and supercedes it. This new project cleanly separates the VIAF-specific parts and the more “boilerplate” pieces needed for any OpenRefine reconciliation service. The result is a framework that allows you to easily write new reconciliation services. My intent here is to make some existing code way more flexible, so that it might be useful to more users and have a longer lifespan. has already been running conciliator for a week now; if you’ve been using it, you don’t need to make any changes in OpenRefine.

Currently, conciliator out-of-the-box can query VIAF exactly like refine_viaf does, down to the same URLs. Additionally, conciliator can now query ORCID names. This was a somewhat arbitrary choice; I’ve been doing some ORCID integration at work so it was convenient for me to implement a data source for it as a proof of concept.

With VIAF and ORCID, conciliator acts as an intermediate or “bridge” service, but it would be possible to use conciliator to query other types of data sources as well: files, SQL databases, etc. Right now, you’d have to write your own code to read and parse files, open database connections, etc. But in the future, I hope to add support for these options to make them easier to implement.

For details on how to write your own service in Java using conciliator, see the README.

Are there data sources you’d like to see available as a reconciliation service? Leave a comment to this post. No promises, but I’ll at least consider all requests. And if you write your own service for a data source, please consider submitting your code as a pull request so that others can use it too!

The Myth of Artisanal Programming

Paul Chiusano, the author of the excellent Functional Programming in Scala from Manning (one of the few tech publishers I buy from; worth every penny), recently wrote a blog post titled, “The advantages of static typing, simply stated”.

Lately all I seem to do is rant to people about this exact topic. Paul’s post is way more succinct than anything I can write, so go over there and read it.

While he takes pains to give a balanced treatment of static vs dynamic type systems, it seems much more cut and dry to me. Dynamic languages are easier and faster for development when you’re getting started on a project, and it’s great if that project never gets very big. But they scale very poorly, for all the reasons he describes. Recently, I had the daunting task of reading almost ~10k lines of Perl code (pretty good Perl, in my opinion). It was hard to make sense of and figure out how to modify and extend, whereas the MUCH larger Java codebase (over 100k lines, if I recall) that I worked with years ago felt very manageable.

My own history as a programmer matches Paul’s very closely. I started with Java, which was annoying but not a bad language by any means. Then Python came along and seemed like a liberation from Java’s rigidity and verbosity. But Python, Ruby and others are showing their weaknesses, and it’s no mystery why people are turning to the newer generation of statically typed languages like Scala, Haskell, Go, etc.

People who haven’t been around as long don’t necessarily have this perspective.

In retrospect, it’s interesting to me how we programmers “got sold” on dynamic languages, from a cultural perspective. You might recall that a big selling point was using simple text editors rather than IDEs, and there was this sense that writing code this way made you closer to the software somehow. Java was corporate, while Python was hand-crafted. There was a vague implicit notion of “artisanal” programming in these circles.

The upshot, of course, is that every time you read a chunk of code or call a function or method, your brain has to do a lot of the work that a statically typed language would be able to enforce and verify for you. But in a dynamic language, you won’t know what happens until the code runs. In large measure, the quality of software hinges on how much you can tell, a priori, about code before it runs at all. In a dynamic world, anything can happen, and often does.

This is a nightmare, pure and simple. Much of the strong focus on writing automated tests is to basically make up for the lack of static typing.

True artisanship lies in design: namely, thinking hard about the data structures and code organization you’re committing to. It’s not about being able to take liberties that can result in things that make no sense to the machine and that can cause errors at runtime that could have been caught beforehand.

The Linux Desktop bleeding edge

I’m having some trouble running Firefox 46, which was released in late April. I’ve had to roll back to 45.0.2 for now. No big deal, but my woes are pretty indicative of the complexities of the Linux desktop, so I thought it’d be interesting to write a little about it.

I run the “testing” distribution of Debian. Its stability lies somewhere between the “unstable” (things are largely untested) and “stable” (release quality) dists, so it’s pretty good, although the occasional hiccup is to be expected. My desktop environment is Xfce.

Firefox 46 contained a big change: the official binary releases are compiled using gtk3 instead of gtk2. I like using the official releases because the debian firefox package sometimes takes a little while to catch up to the latest version. For those who are unfamiliar, gtk is the graphics library for rendering the user interface, including all the widgets and their look-and-feel.

gtk3, in turn, has its own versions. In late March, gtk3 in Debian testing was updated to 3.20, which apparently contained some major changes from 3.19.

The problem is that Firefox 46 seems to work fine with pre-3.20 versions of gtk3; with 3.20, however, scrollbars, radio buttons and other widgets are rendered incorrectly or are even missing entirely. One of the bug reports can be found here.

You can apparently work around this issue if you use certain gtk3 themes. Not being a theme guru, I’m not sure exactly why; I was only able to determine this by experimenting with different themes and seeing how the Firefox rendering changed.

Okay, fine: I’d been using the default Xfce theme, which I quite like, but I’m willing to change it to make Firefox work. But I still encountered 2 problems with this workaround: 1) 3.20 is so new that many gtk3 themes included in Debian testing are broken because they haven’t been updated yet to be compatible. While I could get the scrollbars and radio buttons to work with some of these themes, there were often spacing issues around certain widgets, making UIs unusable or extremely annoying. 2) I need to find a theme that supports BOTH gtk2 and gtk3, since Xfce uses gtk2, otherwise I’ll end up with inconsistent look-and-feel across applications. Not all themes support both.

People complain about the state of the Linux desktop all the time, but the fact is, there are many moving parts that comprise a desktop environment. It’s a complex web of dependencies. Sometimes this means certain software packages have to be locked in to previous versions. Sometimes the newest version of a library can’t go into a distribution because it would break too many things that use it. Being able to run the latest and greatest versions of everything is a LOT harder than one might imagine.

In this case, I’m sure there will be a fix in Firefox and/or updates to the gtk3 themes soon enough.

Algorithms I: Notes in Week 5

Scattered thoughts:

A course on learning a programming language will help answer the question, “how do I do X?” The fun thing about an algorithms course is that the question is “how do I do X within certain parameters of time and space?”

In the real world, the two questions are actually one and the same. I’ve just come away from a project that had serious scalability problems, because many of its features could handle only very small sets of data used in development; when the app was run against live data, things stopped working because they would hit a timeout limit or processes would run out of memory.

I’m learning quickly that I can often intuit the “shape” of how an algorithm will perform, and I now have better language for describing this, but I’m not so good at calculating precisely the order of growth for even slightly complex code. It’s hard!

One paranoia-inducing aspect of programming assignments: for week 4’s assignment, a single timing test (1 out of 17) failed for my code because it took too long to finish. It’s hard to figure out… does this single failure expose a flaw in my overall implementation (if so, why did the other 16 pass)? Or was this last test thrown in as a “bonus” involving a difficult set of inputs that would require further optimization if you wanted to get full points? This is a tricky thing to assess as a student, and something only a human being would be able to tell you.

Trees are truly magical. I feel like I’ve barely started to grasp their many applications.

Algorithms I on Coursera

I’m currently taking the “Algorithms I” course on Coursera, a session of which started on January 22nd. I thought I’d write up my impressions so far on taking my first MOOC.

As someone who taught at a university for seven years in the humanities, I should say right off the bat that I dislike the idea of online learning for the reasons you might expect. But this course appealed to me for a few reasons. It’s developed and taught by Robert Sedgewick and Kevin Wayne, the authors of the highly regarded Algorithms, 4th Edition book. The syllabi of the two-course sequence on Coursera would make for the type of semester-length course you’d find in a respectable Computer Science department. Finally, Coursera has a reputation for offering more rigorous and demanding courses than other similar MOOC sites.

So far, I’m keeping up with the schedule and am in the middle of the Week 2 material. I’ve found it to be a positive experience so far, and more challenging than I’d expected!

Some initial impressions:

  • The course is a serious time commitment. Per week, it’s 2 hours of lecture + 2 hours for exercises + 4-12 hours for the programming assignment. I’ve chosen to skip the “interview questions” supplementary material.
  • Assignment grading is, thus far, very rigorous. Submitted source code is analyzed and run through a battery of tests measuring not only correctness, but code cleanliness, run times, and memory use, and scored accordingly.
  • The ability to submit exercises and assignments as many times as you like in order to improve your grade score is a fantastic feature. (I don’t know if all Coursera courses work this way.) It means you can really learn from your mistakes by correcting them; also, it gives you the chance to try out alternative solutions. This is WAY better than the traditional one-shot-only model of graded assignments, which is terrible for actual learning.
  • Basing the course on a published textbook (which is optional) is extremely helpful. There’s material covered more deeply in the text than in the lectures, but the lectures also address some aspects of topics and problems not covered in the book. This makes for a strong complementary relationship between the two; it doesn’t feel like the lectures are simply repeating the textbook.
  • You’re firmly expected to have some basic programming skills and a bit of math as a prerequisite. I like that the lectures keep the focus on the topics at hand, and don’t try to make the course all things to all people. If students need to “catch up” because they’re new to Java or their math is rusty, they use the discussion forums to do so.

As for the actual material, I’ve already learned a lot so far:

  • I’ve gotten some exposure to formal methods for algorithm analysis. A week and a half obviously isn’t going to make anyone great at this, but at least I now have some approaches for thinking through correctness, run times, and memory use mathematically, whereas before, I would mostly work empirically.
  • I can better identify different orders of growth and some of the common code patterns that indicate them.
  • The first week’s case study of different algorithms for Union-Find was, for me, a thought-provoking exercise in what is possible with arrays vs trees in representing relationships among data. The programming assignment is so stringent that it’s difficult to satisfy all the run time and memory requirements for a perfect score. This has generated a lot of insightful discussion in the forums about optimization.

Algorithms really get at the essence of what programming is. Anyone who works as a programmer has to put into practice algorithmic thinking to some degree, even if they aren’t aware of it.

I plan to continue writing about this as a way to keep me accountable for completing the two-course sequence.

A Major Update to refine_viaf

I’ve rewritten my refine_viaf project in Java. It’s what is now running. The old python code is considered deprecated and will no longer be maintained, but will remain available in the python-deprecated branch on github.

The only thing most users need to know is that refine_viaf should return better results now. For the curious, this post explains the subtle but important differences in the new version and some reasons for the rewrite.


In a nutshell, the main difference/improvement is that searches now behave more like the VIAF website.

This is due mainly to how sources (i.e. “LC” for Library of Congress) are handled. Previously, either the source specified on the URL or the “preferred source” from the config file was used to filter out search results, but it did NOT get passed into the actual VIAF search query. This could give you some weird results. The new version works like VIAF’s website: if you don’t specify a source, everything gets searched; if you do specify one, it DOES get passed to the VIAF search query. Simple.

The old version had weird rules for which name in each VIAF “cluster” result it actually displayed. In the new version, if you don’t specify a source, the most popular name (ie. the name used by the most sources) for a search result is used for display. If you specify a source, then its name is always used.

The old version supported a comma-separated list of sources at the end of the URL path. In the new version, only a single source is supported, since that’s what VIAF’s API accepts.

Lastly, the licenses are different: the python version was distributed under a BSD license. The new version is GNU GPL.

Other reasons for the rewrite

The changes above could have been implemented in python. I decided to rewrite it in Java for a few reasons:

– Overall performance is better in Java. The Django app used BeautifulSoup because VIAF’s XML used to be janky, but it appears this is no longer the case; Java’s SAX parser works great with their XML these days and is very fast. BeautifulSoup would leak memory and consume a lot of CPU, to the point where it would trigger automated warnings from my VPS provider. My server is very modest and needs to run other things, so these were real problems. Running the service as a single multi-threaded Java process keeps memory usage low and predictable, and it never spikes the CPU.

– Running a Java jar file is MUCH easier for people who want to run their own service, especially on Windows. With the python version, you had to install pip, install a bunch of packages, and create and configure a Django app, all of which put the software out of reach of many users who might want to run it.

– I don’t care what other people think: I like Java. Plus I wanted to experiment with Spring Boot. There are much leaner web frameworks I could have used to save some memory, but it was interesting to play with Spring.

Leave a comment!

If you use this thing, please take a second and leave a comment on this post. I’m interested to know how many people really run this on their own computers.


A Year of Stats for

A year ago, I wrote an OpenRefine reconciliation service that queries VIAF and posted the source code on github. I’ve also been hosting it publicly at for anyone to use.

Below are two charts showing a year’s worth of usage statistics for this service. The first chart counts web requests made. (The way OpenRefine works, a single web request contains up to 10 name reconciliation queries. 200 web requests can translate to as many as 2000 name reconciliations.) The second chart counts the unique hosts (ie. unique computers, more or less) that used the service.

Last month, the busiest one yet, 47 different computers made an average of approximately 2360 name reconciliations each!

This usage has certainly exceeded my expectations. Now and then, I get a very nice tweet or two from someone who’s used it, which is really gratifying, considering it was just a little side project I threw together. You just never know.

Taking A Fresh Look at PHP

I’ve recently started working on a PHP/Laravel project.

PHP isn’t new to me. Many years ago, I wrote a very simple online catalog and shopping cart, from scratch, for a friend who had his own business as a rare book dealer. He used it with much success for several years. I’d also done a bit of hacking on some Drupal plugins.

Coming back to PHP now, I’m finding myself in a world MUCH different than the one I’d left.

First off, let’s admit that PHP comes with a lot of baggage. For a long time, “real” programmers shunned PHP because it was born as a language cobbled together to do simple web development but not much more. Its ease of use, combined with the fact that it was easy to deploy on commodity web hosting, meant you could find PHP talent for relatively cheap to build your applications. The stereotype was that PHP developers relied on a lot of patchy copy-and-paste solutions to build shoddy and insecure websites.

A LOT has happened since then. Here’s what I’ve encountered so far, diving back into PHP:

Object-orientation: PHP has had objects for a long time, but more recent features like namespaces, traits, and class autoloading have made newer PHP projects very strongly object-oriented. You can even find books on design patterns for PHP.

To me, this is the single most important positive change to the PHP world. The culture has changed from an ad hoc procedural mindset to more sophisticated thinking about coding for large-scale architectures.

Frameworks: Several major MVC frameworks exist, many of them drawing inspiration from Rails.

Performance: As of 5.5, PHP has a built-in opcode cache, making it much more performant. An alternative to core PHP is the HHVM project, backed by Facebook, which is a high-performance PHP implementation. HHVM has had a “rising tide” effect: the forthcoming PHP7 is supposed to be as fast as HHVM. So whatever you use, you can expect good performance at scale.

Tooling: There is sophisticated tooling like composer and a vibrant ecosystem of packages. While you can still deploy PHP applications the old way, using Apache and mod_php, there is a mature FastCGI Process Manager (PHP-FPM) engine that isolates PHP processes from the web server. PHP-FPM allows Apache/nginx/whatever web server to handle static content while a pool of processes handles PHP requests. This results in much more efficient memory usage and better performance.

Success: Many respectable, high-profile products have been built using PHP: WordPress, Drupal, and Facebook, just to name a few.

But all this is just to state a bunch of known facts. To me, the biggest suprise has been in the EXPERIENCE of beginning to write code again in PHP and using Laravel: what does that FEEL like?

In a word, it feels like Java, minus the strong typing. This is an entirely good thing in my opinion, despite criticisms that PHP technologies have become too complex and overdesigned.

The biggest paradigm difference between PHP and other popular web application back-ends is that nothing remains loaded in memory between requests. It’s true that opcode caching means PHP doesn’t have to re-compile PHP source code files to opcodes every time, which speeds things up greatly, but the opcodes still need to be executed for each request, from bootstrapping the application to finishing the HTTP response. In practice, this doesn’t actually matter, but it’s such a significant under-the-hood difference from, say, Django and Rails, that I find myself thinking about it from time to time.

It’s reassuring that when I scour the interwebs researching something PHP-related, I’m finding a lot of informed technical discussions and smart people who have come to PHP from other languages and backgrounds. It bodes well for the strengths and the future of the technology.

On Magic

Kids, I hate to break it to you, but there is no such thing as magic.

The cool whizzy stuff on your screen that impresses you: that’s the result of work. The button that was broken yesterday, that now works correctly today: also the product of work. The screen that was discussed in a meeting last week that suddenly appeared today on the development server: yup, work. When you look for a feature in the web application and it isn’t there, there’s this thing that can create it and put it there: it’s called work.

Someday we’ll all get over the mystifying aura of technology. Someday people will learn to recognize that programmers are not magicians, just workers, and that the work they do involves mundane, non-magical tasks, like wrestling with code libraries and frameworks to get them to do what we want, reorganizing files to make sure stuff exists in sensible places, and figuring out what to do when changing one piece affects three other pieces in unexpected ways.

And this means, someday, people will understand that, like any other kind of work, software development takes resources (namely, time!), not a magic wand. And no amount of “ambition” (read: wishful thinking) can really change that basic equation. You can pretend magic exists, but that doesn’t make it so. You aren’t fooling anyone. You just look childish.

When software development is recognized as work, there can be clarity about what is possible with a given set of resources. Then tasks can be sanely identified, specified, prioritized, coordinated, scheduled, executed, completed.

And then some really cool things can happen. Not magical things, but really cool things. Great things, even. The kind of great things that result from understanding, dedication, and hard work.

Adventures in Docker


After you’ve taken the time to puzzle through what it is exactly, Docker is nothing short of life changing.

In a nutshell, Docker lets you run programs in separate “containers”, isolating the dependencies each one requires. This is similar to virtualization solutions like VMWare and Virtualbox but Docker is a much more fine grained, customizable tool that operates at a different level.

It took me a week of experimentation to develop a firm grasp of the Docker concepts and how all its pieces work together in practice. I’m now using it at work for development, and I hope to be setting up a configuration for staging soon.

This is a short write-up of what I’ve learned so far.

The Concepts

On a first glance, almost everyone (including me) mistakes Docker as another virtualization technology. It’s not. Virtualization lets you run a machine within a machine. A Docker container is more subtle: it segments or isolates part of your existing Linux operating system.

A container consists of a separate memory space and filesystem (taken from something called an image). A container actually uses the same kernel as your “host” system. Some fancy Linux kernel technologies allow all of this to happen; there is no hardware virtualization going on.

You start using Docker by creating an image or using an existing one made by someone else. An image is a filesystem snapshot. You can build images in an automated fashion using a Dockerfile, which allows you to “script” the running of commands to do things like install software and copy files around.

When you launch a container, Docker makes a “copy” of an image (well, not really, but we’ll pretend for now) and uses that to start a process. The fact that the filesystem is a “copy” is important: if you launch 2 containers using the same image, and the processes in each modify files, those changes happen in each CONTAINER’s filesystem; the image doesn’t change. Processes inside containers only see what’s in the container, so they are isolated from one another. This allows complete dependency separation, at the level of processes.

You can do a lot with containers. You can run multiple processes in them (though this is discouraged). You can start another process in an already running container, so it can interact with the already running process. After a container has stopped (by halting the process running in it), you can start it back up again. Again, any file changes made are in the container’s filesystem; the image remains unchanged.


There are three issues I’ve personally encountered so far in using Docker:

1) Persistent Storage

Containers are meant to be transient. In a well designed setup, you should, theoretically, be able to spin up new containers and discard old ones all the time, and not lose any data. This means that your persistent storage has to live somewhere else, in one of two places: a special “data container” or a directory mounted from the host filesystem.

Data containers were too complicated and weird, and I couldn’t get them to work the way I expected, so I mounted directories instead. This has the nice side effect that, as Dockerized processes change files, you can see those changes immediately from the host without having to do anything special to access them. I’m not sure, however, what “best practices” are concerning storage.

2) Multi-Container Applications

Many modern applications consist of several processes or require other applications. For example, my project at work consists of a Rails web app, a delayed_job worker process, an Apache Solr instance, and a MySQL database.

Since Docker strongly recommends a one-process-per-container configuration, you need a way to coordinate a set of running processes and make sure they can communicate with one another. Docker Compose does this, allowing you to easily specify whether containers should be able to open connections to each other’s network ports, share filesystems, etc.

Currently, Docker Compose is not yet considered “production ready.” While it addresses the need to orchestrate processes, there is also the problem of monitoring and restarting processes as needed. It’s not clear to me yet what the best tool is for doing this (it may even be a completely orthogonal concern to what Docker does).

3) Running Commands

Sometimes you need to use the Rails CLI to do things like run database migrations or a rake task. Running commands takes a bit of extra effort, since they need to happen in a container. A slight complication is whether to run the command in the existing Rails container or to start another one entirely for that new process. It’s a bit of typing, which is annoying.

The Payoffs

How is Docker life changing? There are several scenarios I encounter ALL THE TIME that Docker helps tremendously with.

* On the same machine, you can run several web applications, which each require different versions of, say, Ruby, Rails, MySQL and Apache, without dealing with software conflicts and incompatibilities.

* Related to the previous point, Docker lets you more easily experiment with different software without polluting your base system with a lot of crap you might never use again.

* There is MUCH less overhead and less wasted memory usage than with virtualization. If you allocate 1GB of RAM to a virtual machine but only use 512MB, the other half goes to waste. Docker containers only use as much memory as the processes themselves take up, plus a small bit of overhead. Since Docker uses unionfs to “copy” (really, overlay) images to container filesystems, the actual disk space used isn’t as much as you might think.

* Since Docker containers are entirely self-contained, they can be deployed in development, staging, and production environments with almost NO differences among them, thereby minimizing the problems that usually arise.

For me, a lot of the benefits boil down to this: virtualization is amazing, but in practice, I don’t use it that much because it’s too heavyweight and too coarse for my needs. A Virtualbox is a whole other machine that I need to think about. By working at the level of Linux processes, Docker is exactly the right kind of tool for managing application dependencies.

A cautionary note: there’s a lot of buzz right now around containers, including efforts at defining vendor-neutral standards, such as appc. Although Docker releases have been rapid and it is already seeing a lot of adoption, it feels bleeding edge to me. It’s exciting but in a few years, it’s entirely possible that another container solution might surpass it to become the de facto standard. The playing field is just too new, which means Docker comes with some risk. But it’s well worth exploring at this early stage, even if only to get a taste of the new ideas shaping systems configuration and deployment that are definitely here to stay.