Category Archives: java

True Empowerment

I fixed a bug in the blacklight-marc gem recently. It involved this line of Ruby code:

vals << (v == 'AD') ? 'Atlas' : 'Map'

Contrary to what it looks like, this line adds a boolean value to the vals array. The << operation returns true, so the entire line of code always evaluates to ‘Atlas’. Then nothing happens with that string.

Obviously, this isn’t what was intended. The problem is that << has higher precedence than the if-else operators. So here’s the fix:

vals << (v == 'AD' ? 'Atlas' : 'Map')

This code path wasn’t being taken all the time, and it also didn’t raise any exceptions: the calling code uses the result as an array of strings, so the booleans get automatically converted to “true” and “false” strings. I just happened to notice those weird values where they didn’t make sense, and thought to dig into it.

Let’s be honest: this is the kind of mistake anyone could easily make. I’m 100% certain I’ve done something similar. In fact, I innocently asked some co-workers what the original line of code did, and of course, they interpreted it incorrectly. It’s a tricky little bug.

I thought to post about this because it’s a perfect example of how, in a loosely typed, dynamic language like Ruby, you’re really on your own.

Dynamic languages can often feel “empowering” because they place trust in the programmer. It’s your responsibility not to write code that does anything really crazy or stupid. But there are a lot of these “gotcha” cases, where you’re writing code that’s quite reasonable, and you simply made a mistake that the language lets you get away with, because it’s interpeted differently from what you intended. It’s valid code. And you won’t figure it out until much later, when it shows up as a symptom elsewhere.

By contrast, with Java or Scala, you wouldn’t be able to do this. The compiler would check the types, and meaningfully, say, “Sorry buddy, it doesn’t make sense to me to add a boolean to a List of Strings,” and you’d immediately notice the problem with operator precedence. And you’d fix it.

Your program would never even be able to run with that error in it. Which is some awfully nice work that the language is doing for you there. That feels like true empowerment to me.

Final note: you could argue that good test coverage would catch this. That’s true, but we all know the difficulties of achieving thorough test coverage under deadlines. And this example is particularly annoying to get thorough coverage for, because the line of code is one case of many different cases of values for the variable ‘v’.

Announcing conciliator

I’ve just created a github repository for conciliator, a growing collection of OpenRefine reconciliation services, as well as a Java framework for creating them.

conciliator is a major refactoring of my refine_viaf project and supercedes it. This new project cleanly separates the VIAF-specific parts and the more “boilerplate” pieces needed for any OpenRefine reconciliation service. The result is a framework that allows you to easily write new reconciliation services. My intent here is to make some existing code way more flexible, so that it might be useful to more users and have a longer lifespan. has already been running conciliator for a week now; if you’ve been using it, you don’t need to make any changes in OpenRefine.

Currently, conciliator out-of-the-box can query VIAF exactly like refine_viaf does, down to the same URLs. Additionally, conciliator can now query ORCID names. This was a somewhat arbitrary choice; I’ve been doing some ORCID integration at work so it was convenient for me to implement a data source for it as a proof of concept.

With VIAF and ORCID, conciliator acts as an intermediate or “bridge” service, but it would be possible to use conciliator to query other types of data sources as well: files, SQL databases, etc. Right now, you’d have to write your own code to read and parse files, open database connections, etc. But in the future, I hope to add support for these options to make them easier to implement.

For details on how to write your own service in Java using conciliator, see the README.

Are there data sources you’d like to see available as a reconciliation service? Leave a comment to this post. No promises, but I’ll at least consider all requests. And if you write your own service for a data source, please consider submitting your code as a pull request so that others can use it too!

The Myth of Artisanal Programming

Paul Chiusano, the author of the excellent Functional Programming in Scala from Manning (one of the few tech publishers I buy from; worth every penny), recently wrote a blog post titled, “The advantages of static typing, simply stated”.

Lately all I seem to do is rant to people about this exact topic. Paul’s post is way more succinct than anything I can write, so go over there and read it.

While he takes pains to give a balanced treatment of static vs dynamic type systems, it seems much more cut and dry to me. Dynamic languages are easier and faster for development when you’re getting started on a project, and it’s great if that project never gets very big. But they scale very poorly, for all the reasons he describes. Recently, I had the daunting task of reading almost ~10k lines of Perl code (pretty good Perl, in my opinion). It was hard to make sense of and figure out how to modify and extend, whereas the MUCH larger Java codebase (over 100k lines, if I recall) that I worked with years ago felt very manageable.

My own history as a programmer matches Paul’s very closely. I started with Java, which was annoying but not a bad language by any means. Then Python came along and seemed like a liberation from Java’s rigidity and verbosity. But Python, Ruby and others are showing their weaknesses, and it’s no mystery why people are turning to the newer generation of statically typed languages like Scala, Haskell, Go, etc.

People who haven’t been around as long don’t necessarily have this perspective.

In retrospect, it’s interesting to me how we programmers “got sold” on dynamic languages, from a cultural perspective. You might recall that a big selling point was using simple text editors rather than IDEs, and there was this sense that writing code this way made you closer to the software somehow. Java was corporate, while Python was hand-crafted. There was a vague implicit notion of “artisanal” programming in these circles.

The upshot, of course, is that every time you read a chunk of code or call a function or method, your brain has to do a lot of the work that a statically typed language would be able to enforce and verify for you. But in a dynamic language, you won’t know what happens until the code runs. In large measure, the quality of software hinges on how much you can tell, a priori, about code before it runs at all. In a dynamic world, anything can happen, and often does.

This is a nightmare, pure and simple. Much of the strong focus on writing automated tests is to basically make up for the lack of static typing.

True artisanship lies in design: namely, thinking hard about the data structures and code organization you’re committing to. It’s not about being able to take liberties that can result in things that make no sense to the machine and that can cause errors at runtime that could have been caught beforehand.

Algorithms I: Notes in Week 5

Scattered thoughts:

A course on learning a programming language will help answer the question, “how do I do X?” The fun thing about an algorithms course is that the question is “how do I do X within certain parameters of time and space?”

In the real world, the two questions are actually one and the same. I’ve just come away from a project that had serious scalability problems, because many of its features could handle only very small sets of data used in development; when the app was run against live data, things stopped working because they would hit a timeout limit or processes would run out of memory.

I’m learning quickly that I can often intuit the “shape” of how an algorithm will perform, and I now have better language for describing this, but I’m not so good at calculating precisely the order of growth for even slightly complex code. It’s hard!

One paranoia-inducing aspect of programming assignments: for week 4’s assignment, a single timing test (1 out of 17) failed for my code because it took too long to finish. It’s hard to figure out… does this single failure expose a flaw in my overall implementation (if so, why did the other 16 pass)? Or was this last test thrown in as a “bonus” involving a difficult set of inputs that would require further optimization if you wanted to get full points? This is a tricky thing to assess as a student, and something only a human being would be able to tell you.

Trees are truly magical. I feel like I’ve barely started to grasp their many applications.

Algorithms I on Coursera

I’m currently taking the “Algorithms I” course on Coursera, a session of which started on January 22nd. I thought I’d write up my impressions so far on taking my first MOOC.

As someone who taught at a university for seven years in the humanities, I should say right off the bat that I dislike the idea of online learning for the reasons you might expect. But this course appealed to me for a few reasons. It’s developed and taught by Robert Sedgewick and Kevin Wayne, the authors of the highly regarded Algorithms, 4th Edition book. The syllabi of the two-course sequence on Coursera would make for the type of semester-length course you’d find in a respectable Computer Science department. Finally, Coursera has a reputation for offering more rigorous and demanding courses than other similar MOOC sites.

So far, I’m keeping up with the schedule and am in the middle of the Week 2 material. I’ve found it to be a positive experience so far, and more challenging than I’d expected!

Some initial impressions:

  • The course is a serious time commitment. Per week, it’s 2 hours of lecture + 2 hours for exercises + 4-12 hours for the programming assignment. I’ve chosen to skip the “interview questions” supplementary material.
  • Assignment grading is, thus far, very rigorous. Submitted source code is analyzed and run through a battery of tests measuring not only correctness, but code cleanliness, run times, and memory use, and scored accordingly.
  • The ability to submit exercises and assignments as many times as you like in order to improve your grade score is a fantastic feature. (I don’t know if all Coursera courses work this way.) It means you can really learn from your mistakes by correcting them; also, it gives you the chance to try out alternative solutions. This is WAY better than the traditional one-shot-only model of graded assignments, which is terrible for actual learning.
  • Basing the course on a published textbook (which is optional) is extremely helpful. There’s material covered more deeply in the text than in the lectures, but the lectures also address some aspects of topics and problems not covered in the book. This makes for a strong complementary relationship between the two; it doesn’t feel like the lectures are simply repeating the textbook.
  • You’re firmly expected to have some basic programming skills and a bit of math as a prerequisite. I like that the lectures keep the focus on the topics at hand, and don’t try to make the course all things to all people. If students need to “catch up” because they’re new to Java or their math is rusty, they use the discussion forums to do so.

As for the actual material, I’ve already learned a lot so far:

  • I’ve gotten some exposure to formal methods for algorithm analysis. A week and a half obviously isn’t going to make anyone great at this, but at least I now have some approaches for thinking through correctness, run times, and memory use mathematically, whereas before, I would mostly work empirically.
  • I can better identify different orders of growth and some of the common code patterns that indicate them.
  • The first week’s case study of different algorithms for Union-Find was, for me, a thought-provoking exercise in what is possible with arrays vs trees in representing relationships among data. The programming assignment is so stringent that it’s difficult to satisfy all the run time and memory requirements for a perfect score. This has generated a lot of insightful discussion in the forums about optimization.

Algorithms really get at the essence of what programming is. Anyone who works as a programmer has to put into practice algorithmic thinking to some degree, even if they aren’t aware of it.

I plan to continue writing about this as a way to keep me accountable for completing the two-course sequence.

A Major Update to refine_viaf

I’ve rewritten my refine_viaf project in Java. It’s what is now running. The old python code is considered deprecated and will no longer be maintained, but will remain available in the python-deprecated branch on github.

The only thing most users need to know is that refine_viaf should return better results now. For the curious, this post explains the subtle but important differences in the new version and some reasons for the rewrite.


In a nutshell, the main difference/improvement is that searches now behave more like the VIAF website.

This is due mainly to how sources (i.e. “LC” for Library of Congress) are handled. Previously, either the source specified on the URL or the “preferred source” from the config file was used to filter out search results, but it did NOT get passed into the actual VIAF search query. This could give you some weird results. The new version works like VIAF’s website: if you don’t specify a source, everything gets searched; if you do specify one, it DOES get passed to the VIAF search query. Simple.

The old version had weird rules for which name in each VIAF “cluster” result it actually displayed. In the new version, if you don’t specify a source, the most popular name (ie. the name used by the most sources) for a search result is used for display. If you specify a source, then its name is always used.

The old version supported a comma-separated list of sources at the end of the URL path. In the new version, only a single source is supported, since that’s what VIAF’s API accepts.

Lastly, the licenses are different: the python version was distributed under a BSD license. The new version is GNU GPL.

Other reasons for the rewrite

The changes above could have been implemented in python. I decided to rewrite it in Java for a few reasons:

– Overall performance is better in Java. The Django app used BeautifulSoup because VIAF’s XML used to be janky, but it appears this is no longer the case; Java’s SAX parser works great with their XML these days and is very fast. BeautifulSoup would leak memory and consume a lot of CPU, to the point where it would trigger automated warnings from my VPS provider. My server is very modest and needs to run other things, so these were real problems. Running the service as a single multi-threaded Java process keeps memory usage low and predictable, and it never spikes the CPU.

– Running a Java jar file is MUCH easier for people who want to run their own service, especially on Windows. With the python version, you had to install pip, install a bunch of packages, and create and configure a Django app, all of which put the software out of reach of many users who might want to run it.

– I don’t care what other people think: I like Java. Plus I wanted to experiment with Spring Boot. There are much leaner web frameworks I could have used to save some memory, but it was interesting to play with Spring.

Leave a comment!

If you use this thing, please take a second and leave a comment on this post. I’m interested to know how many people really run this on their own computers.


Looking at Go

In the latest stage of my exploration/deepening of programming knowledge, I’ve been looking at Go.

There’s got to be something that piques my intellectual curiosity or solves a specific problem for me to want to learn a new language. Not much about the latest “hot” languages like Ruby, Scala, and Erlang appeals to me, so I haven’t bothered with them. In real world work, I like Python as a general purpose language, and I like Java (seriously!) for large projects that need the strong tooling and frameworks available for it. Lisp and Clojure have provided useful perspective and food for thought, but in practice, they haven’t found a place in the real world software I write. Everything else I tolerate only because I have to (I’m looking at you, Javascript).

Go is extremely intriguing. It strikes me as combining some of the best things about Python and Java. It would be great not to have to choose! I like the simple syntax (not as simple as Python, alas!), the static typing, the fact that it’s compiled, and the general philosophy of favoring composition over inheritance, an idea I’ve come to support more and more. In a world currently dominated by highly dynamic, interpreted languages with very loose typing systems and a hierarchical object oriented paradigm, Go is incredibly unique! Follow the trend of languages like Clojure, Go has concurrency features that take strong advantage of multicore computing, except that its concurrency mechanisms seem much simpler. I’ve started to look at code samples and play with it a bit, and I really like what I see so far.

There’s actually a lot of negative discussions of Go on the web, but most of them are about the language in its messy pre-1.0 state. The March 1.0 release has supposedly tightened up a lot of things, and of course, performance will only get better, now that the fundamental semantics and features are solidly in place. This is an exciting time for what feels like the next evolutionary step in programming languages.

Composition and Inheritance

For as long as I can remember, writing any kind of non-trivial software meant you needed to use object-oriented programming. It was a no-brainer. So I learned all the fundamentals of OOP, and design patterns as well, since one couldn’t get very far in Java without knowing the most common patterns.

I think taking OOP for granted as the only natural way to manage complexity is why learning Lisp is so mind-blowing for programmers like myself. Take, for example, polymorphism. I didn’t know that there was anything besides parametric polymorphism–and I didn’t know it was called that; I knew it only as polymorphism, plain and simple. The ability of Lisp to do multiple dispatch was incredibly eye-opening.

I think this is the sort of thing people mean when they go on about how Lisp has broadened their horizons and deepened their understanding of concepts.

To mention another example, in a bit more detail: as an experiment in some recent python code for work, I’ve been using fewer classes/objects and more functions. (It’s debatable just how much actual “functional programming” mileage you can get out of Python, but I’ll put that aside for now.) In such an approach, you inevitably end up using a lot of composition, rather than object inheritance, to build higher level abstractions. And that’s been working out very well so far.

Composition is a powerful thing because you can control the granularity of code reuse. With a carefully constructed library of functions, you can choose to call the functions at the appropriate level of abstraction you need, and even mix and match. That’s much harder to do with object inheritance, where classes force you into an all-or-nothing package deal–if you want only some of the functionality, you need to instantiate the whole object anyway. And if you want to selectively override functionality, you need to subclass, which effectively ties the parent class to its subtree, making it harder to modify in the future.

I’ve been thinking lately that objects are useful mostly to facilitate data abstraction: there’s no reason not to group together related accessors and mutators. But when I consider more complex bundles of functionality, I think twice before creating a class hierarchy and see if I can do it using functional composition instead.

Bigger! Faster! Stronger! 3 GB in the 2.0 Ghz Macbook

The official Apple specs say that the 2.0Ghz Macbook can take up to 2 GB of memory. There’s a bit of information on the web–like this forum posting, for example–that says you can go up to 3 GB. The system board can address slightly more than 3 GB, so the 2 GB limit is reportedly lower than what the hardware is capable of. By chance, I noticed a 2 GB chip for a reasonable $40 on my local craigslist, so I decided to see for myself whether the stories were true.

It seems they are! I’ve been running the computer for a little over a week now with two chips: a 1 GB module, which had been in there before, and the new 2 GB module. It’s been put through its paces: on a given work day, I run Eclipse, Firefox, Thunderbird, Colloquy (an irc client), Adium (instant messaging client), Skype (a VoIP client), a java application server in development mode, a mysql server, and emacs, all at once.

The swap size has still been high, typically around ~500 MB, but there are no longer the delays that I used to experience with 2 GB of memory when I had a lot of apps open and switched between them. So far, there have been no problems with stability.

Here’s a screenshot from System Profiler:

Note that Apple released two different models with the 2.0 Ghz Core 2 Duo processor. You can find your machine’s model in System Profiler on the Hardware Overview screen. The 2 GB official limit applies to “Macbook2,1.” The later version, “Macbook3,1” can take up to 4 GB.

With the additional memory, this machine will hopefully last me another two years.

Eclipse and JDK 1.6.0_05 on Mac OS

Last week, Java 1.6 went out of Developer Preview and became an “official” release for Mac OS 10.5.2. (You still can’t get 1.6 for 10.5.1, sadly.) I’ve been fiddling with 1.6 and Eclipse, trying to get them to play well together, and here’s what I’ve found so far.

Eclipse itself needs to run on 1.5. There’s a great blog post, “Running Eclipse on MacBooks with Java 6”, written by one “rkischuk,” that explains why: 1.6 doesn’t support 32-bit SWT-Cocoa bindings, so Eclipse will bomb. The error I got was a mysterious “JVM Terminated. Exit code=-1” and a list of run-time options. If you run Eclipse from a shell, you might see this:

2008-05-09 10:53:55.443 eclipse[257:10b] Cannot find executable for CFBundle 0x116030 (not loaded)

or maybe this:

_NSJVMLoadLibrary: NSAddLibrary failed for /System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Libraries/libjvm.dylib
JavaVM FATAL: Failed to load the jvm library.

When I installed 1.6, I had messed around with /System/Library/Frameworks/JavaVM.framework/Versions, trying to get 1.6 to run as the system default. But the cleanest solution for me was to KEEP 1.5 as the default. So make sure that directory looks like this:

drwxr-xr-x 11 root wheel 374 May 9 10:49 ..
lrwxr-xr-x 1 root wheel 5 May 5 22:41 1.3 -> 1.3.1
drwxr-xr-x 3 root wheel 102 Nov 2 2007 1.3.1
lrwxr-xr-x 1 root wheel 5 Apr 18 13:07 1.4 -> 1.4.2
lrwxr-xr-x 1 root wheel 3 May 5 22:41 1.4.1 -> 1.4
drwxr-xr-x 8 root wheel 272 Apr 27 2007 1.4.2
lrwxr-xr-x 1 root wheel 5 Apr 18 13:07 1.5 -> 1.5.0
drwxr-xr-x 8 root wheel 272 Apr 27 2007 1.5.0
lrwxr-xr-x 1 root wheel 5 May 5 22:41 1.6 -> 1.6.0
drwxr-xr-x 8 root wheel 272 Apr 18 14:03 1.6.0
drwxr-xr-x 9 root wheel 306 May 9 10:50 A
lrwxr-xr-x 1 root wheel 1 May 9 11:13 Current -> A
lrwxr-xr-x 1 root wheel 3 May 9 11:12 CurrentJDK -> 1.5

Eclipse should run as it normally does.

If your code project(s) don’t require SWT, you can use 1.6 as an Installed JRE within Eclipse. Go to Preferences -> Java -> Installed JREs -> Add…. Select “Mac OS VM” and point it to:


Building projects and running JBoss using the 1.6 seems to work just fine.