I’ve rewritten my refine_viaf project in Java. It’s what refine.codefork.com is now running. The old python code is considered deprecated and will no longer be maintained, but will remain available in the python-deprecated branch on github.
The only thing most users need to know is that refine_viaf should return better results now. For the curious, this post explains the subtle but important differences in the new version and some reasons for the rewrite.
Differences
In a nutshell, the main difference/improvement is that searches now behave more like the VIAF website.
This is due mainly to how sources (i.e. “LC” for Library of Congress) are handled. Previously, either the source specified on the URL or the “preferred source” from the config file was used to filter out search results, but it did NOT get passed into the actual VIAF search query. This could give you some weird results. The new version works like VIAF’s website: if you don’t specify a source, everything gets searched; if you do specify one, it DOES get passed to the VIAF search query. Simple.
The old version had weird rules for which name in each VIAF “cluster” result it actually displayed. In the new version, if you don’t specify a source, the most popular name (ie. the name used by the most sources) for a search result is used for display. If you specify a source, then its name is always used.
The old version supported a comma-separated list of sources at the end of the URL path. In the new version, only a single source is supported, since that’s what VIAF’s API accepts.
Lastly, the licenses are different: the python version was distributed under a BSD license. The new version is GNU GPL.
Other reasons for the rewrite
The changes above could have been implemented in python. I decided to rewrite it in Java for a few reasons:
– Overall performance is better in Java. The Django app used BeautifulSoup because VIAF’s XML used to be janky, but it appears this is no longer the case; Java’s SAX parser works great with their XML these days and is very fast. BeautifulSoup would leak memory and consume a lot of CPU, to the point where it would trigger automated warnings from my VPS provider. My server is very modest and needs to run other things, so these were real problems. Running the service as a single multi-threaded Java process keeps memory usage low and predictable, and it never spikes the CPU.
– Running a Java jar file is MUCH easier for people who want to run their own service, especially on Windows. With the python version, you had to install pip, install a bunch of packages, and create and configure a Django app, all of which put the software out of reach of many users who might want to run it.
– I don’t care what other people think: I like Java. Plus I wanted to experiment with Spring Boot. There are much leaner web frameworks I could have used to save some memory, but it was interesting to play with Spring.
Leave a comment!
If you use this thing, please take a second and leave a comment on this post. I’m interested to know how many people really run this on their own computers.
Enjoy.
Hi Jeff. I’ve been playing around with and utilizing this reconciliation service for a few weeks now, and wanted to thank you for putting it together and making it available. It’s been really useful for me, so thanks! I used the hosted version a few times, and have started running it locally, which so far works great. One thing I wanted to mention is that when I specify a source to search (e.g. LC or BNF), it still seems to return the most popular name, rather than the version of the name from that source. Thanks again!
-Elliot
Thanks for the comment! I really appreciate it. The problem with sources that you described was due to VIAF adding some fields to their XML, and the software not handling them very well. I posted an update that should fix the bug:
https://github.com/codeforkjeff/refine_viaf/releases
Hi Jeff, I am really happy with the 12/15 updates! My results are much more accurate! I am so thrilled that you created this, and I plan to tell a bunch of music catalogers and librarians about your service next week in a presentation I am giving.
Hi Jennifer,
Thanks for taking a minute to leave a comment, I really appreciate it! Glad you are finding refine_viaf useful.
Thank you enormously for this service! I am impressed with the accuracy of the results and blown away with how much time you just saved me. I will be singing the praises of your service in archives settings!
I appreciate your taking the time to leave a comment here, Amanda! Glad it is useful to you.
Just want to say that I’ve used this for a project and really found it useful. Thanks for writing it! One thing I found confusing was how to get ID’s and labels from terms that were reconciled. When I set up a service to get names from LC, I find that what is returned in the GREL cell.recon.candidate[0].name is the LC label but cell.recon.candidate[0].id is the VIAF ID. I was able to get to the LC ID by fetching the VIAF URL justlinks.json and parsing the JSON in OpenRefine, but it was not straight forward. Maybe the LC ID was returned in the reconciliation service, but it wasn’t clear to me where it was stored and how to get it. It would be a great improvement to this service. Thanks again for writing it!
Hi Shelley,
Being able to get name IDs from LC and other source institutions is a wonderful suggestion, useful to others as well I’m sure. I’ve added a “proxy mode” feature in v1.4 to enable you to do this. See this post:
http://codefork.com/blog/index.php/2016/06/06/proxy-mode-added-in-refine_viaf-1-4/
Thanks for your feedback!