A Major Update to refine_viaf

I’ve rewritten my refine_viaf project in Java. It’s what refine.codefork.com is now running. The old python code is considered deprecated and will no longer be maintained, but will remain available in the python-deprecated branch on github.

The only thing most users need to know is that refine_viaf should return better results now. For the curious, this post explains the subtle but important differences in the new version and some reasons for the rewrite.

Differences

In a nutshell, the main difference/improvement is that searches now behave more like the VIAF website.

This is due mainly to how sources (i.e. “LC” for Library of Congress) are handled. Previously, either the source specified on the URL or the “preferred source” from the config file was used to filter out search results, but it did NOT get passed into the actual VIAF search query. This could give you some weird results. The new version works like VIAF’s website: if you don’t specify a source, everything gets searched; if you do specify one, it DOES get passed to the VIAF search query. Simple.

The old version had weird rules for which name in each VIAF “cluster” result it actually displayed. In the new version, if you don’t specify a source, the most popular name (ie. the name used by the most sources) for a search result is used for display. If you specify a source, then its name is always used.

The old version supported a comma-separated list of sources at the end of the URL path. In the new version, only a single source is supported, since that’s what VIAF’s API accepts.

Lastly, the licenses are different: the python version was distributed under a BSD license. The new version is GNU GPL.

Other reasons for the rewrite

The changes above could have been implemented in python. I decided to rewrite it in Java for a few reasons:

– Overall performance is better in Java. The Django app used BeautifulSoup because VIAF’s XML used to be janky, but it appears this is no longer the case; Java’s SAX parser works great with their XML these days and is very fast. BeautifulSoup would leak memory and consume a lot of CPU, to the point where it would trigger automated warnings from my VPS provider. My server is very modest and needs to run other things, so these were real problems. Running the service as a single multi-threaded Java process keeps memory usage low and predictable, and it never spikes the CPU.

– Running a Java jar file is MUCH easier for people who want to run their own service, especially on Windows. With the python version, you had to install pip, install a bunch of packages, and create and configure a Django app, all of which put the software out of reach of many users who might want to run it.

– I don’t care what other people think: I like Java. Plus I wanted to experiment with Spring Boot. There are much leaner web frameworks I could have used to save some memory, but it was interesting to play with Spring.

Leave a comment!

If you use this thing, please take a second and leave a comment on this post. I’m interested to know how many people really run this on their own computers.

Enjoy.

9 thoughts on “A Major Update to refine_viaf

  1. Elliot Williams

    Hi Jeff. I’ve been playing around with and utilizing this reconciliation service for a few weeks now, and wanted to thank you for putting it together and making it available. It’s been really useful for me, so thanks! I used the hosted version a few times, and have started running it locally, which so far works great. One thing I wanted to mention is that when I specify a source to search (e.g. LC or BNF), it still seems to return the most popular name, rather than the version of the name from that source. Thanks again!
    -Elliot

  2. Jennifer

    Hi Jeff, I am really happy with the 12/15 updates! My results are much more accurate! I am so thrilled that you created this, and I plan to tell a bunch of music catalogers and librarians about your service next week in a presentation I am giving.

  3. jeff Post author

    Hi Jennifer,

    Thanks for taking a minute to leave a comment, I really appreciate it! Glad you are finding refine_viaf useful.

  4. Amanda

    Thank you enormously for this service! I am impressed with the accuracy of the results and blown away with how much time you just saved me. I will be singing the praises of your service in archives settings!

  5. jeff Post author

    I appreciate your taking the time to leave a comment here, Amanda! Glad it is useful to you.

  6. Shelley

    Just want to say that I’ve used this for a project and really found it useful. Thanks for writing it! One thing I found confusing was how to get ID’s and labels from terms that were reconciled. When I set up a service to get names from LC, I find that what is returned in the GREL cell.recon.candidate[0].name is the LC label but cell.recon.candidate[0].id is the VIAF ID. I was able to get to the LC ID by fetching the VIAF URL justlinks.json and parsing the JSON in OpenRefine, but it was not straight forward. Maybe the LC ID was returned in the reconciliation service, but it wasn’t clear to me where it was stored and how to get it. It would be a great improvement to this service. Thanks again for writing it!

  7. Pingback: “Proxy mode” added in refine_viaf 1.4 | codefork.com

Leave a Reply

Your email address will not be published. Required fields are marked *