jeff – Page 5 – codefork.com

Functional Languages

Functional programming is a hot topic these days, one that I’ve become increasingly interested in over the last two years. One marker of its rising popularity is the publication of books on how to do FP in mainstream languages such as Python and Javascript.

While it seems worthwhile to adopt FP concepts and get their benefits wherever you can, these approaches inevitably run up against the fact that you need certain language features in order for FP to work smoothly in practice. In other words, there is functional programming as a paradigm, and there are functional languages, which provide crucial features that make the paradigm actually work effectively in the real world.

I’ve been gradually understanding this better while learning Scala. If, like myself, you’re a newcomer to functional languages, here are three prominent features that illustrate what I mean by things that a full-fledged functional language provides.

1) Lists: This data structure, implemented as a linked list, is a staple of functional languages. And since you use it everywhere, it’s important to understand its performance characteristics: appending to a list takes n time (where n is the size of the list) whereas prepending takes constant time; retrieving an element at index i takes i iterations over the nodes in the list; etc.

Coming from other languages, it might seem weird that the linked list (rather than the fixed-length array) plays such a prominent role, but the reason is that it’s great for iterative recursion: you process the “head” or first element of the list, and recursively call your function with the “tail” or the remaining elements in the list. A linked list makes this head/tail pattern very fast, because it avoids having to make a copy of the list on each recursive call: retrieving the list’s tail only requires advancing a memory pointer.

More broadly, because data is immutable in pure FP, a language needs persistent data structures so that when you manipulate those data structures, you avoid making unnecessary copies in memory. This isn’t a problem in the imperative paradigm, where mutating data is the norm.

2) Tail call optimization: Scala and other functional languages implement tail call optimization: if a function returns a recursive call to itself, it does not grow the call stack for each iteration. Since the result of the very last call will be the result of the first call in the recursive chain, it can safely circumvent the stack.

This is a crucial optimization. A problem with deep recursion is that you run the risk of overflowing the call stack. Without this language feature, deeply recursive code will crash.

3) Lazy evaluation: the ability to lazily evaluate functions allows you to do some very powerful things, such as create infinite data structures and invent custom control structures. A mechanism for lazy evaluation avoids having to jump through hoops to accomplish the same thing.

If you look through Functional Python Programming (which is a very good book), it addresses the fact that Python lacks the above features, and offers some options for simulating them or working around them somehow. To newcomers, such discussions and tricks can be a little confusing, because it’s hard to grasp why the lack of these features is a shortcoming in the first place.

This is why I think it’s best for people new to FP to approach learning Scala as a functional language. When I initially tried to learn it through resources and books that aimed to cover the language and API comprehensively, it was simply bewildering. But learning Scala through the lens of FP concepts and the features needed to support those concepts makes it much easier to understand why many of the seemingly odd constructs and idioms exist.

(As a footnote, this is exactly the approach taken by Martin Odersky’s Coursera course, “Functional Programming Principles in Scala”. I’m nearly done going through the lectures, and I think it’s a much better resource than any book I’ve perused.)

Algorithms I: Notes in Week 5

Scattered thoughts:

A course on learning a programming language will help answer the question, “how do I do X?” The fun thing about an algorithms course is that the question is “how do I do X within certain parameters of time and space?”

In the real world, the two questions are actually one and the same. I’ve just come away from a project that had serious scalability problems, because many of its features could handle only very small sets of data used in development; when the app was run against live data, things stopped working because they would hit a timeout limit or processes would run out of memory.

I’m learning quickly that I can often intuit the “shape” of how an algorithm will perform, and I now have better language for describing this, but I’m not so good at calculating precisely the order of growth for even slightly complex code. It’s hard!

One paranoia-inducing aspect of programming assignments: for week 4’s assignment, a single timing test (1 out of 17) failed for my code because it took too long to finish. It’s hard to figure out… does this single failure expose a flaw in my overall implementation (if so, why did the other 16 pass)? Or was this last test thrown in as a “bonus” involving a difficult set of inputs that would require further optimization if you wanted to get full points? This is a tricky thing to assess as a student, and something only a human being would be able to tell you.

Trees are truly magical. I feel like I’ve barely started to grasp their many applications.

Algorithms I on Coursera

I’m currently taking the “Algorithms I” course on Coursera, a session of which started on January 22nd. I thought I’d write up my impressions so far on taking my first MOOC.

As someone who taught at a university for seven years in the humanities, I should say right off the bat that I dislike the idea of online learning for the reasons you might expect. But this course appealed to me for a few reasons. It’s developed and taught by Robert Sedgewick and Kevin Wayne, the authors of the highly regarded Algorithms, 4th Edition book. The syllabi of the two-course sequence on Coursera would make for the type of semester-length course you’d find in a respectable Computer Science department. Finally, Coursera has a reputation for offering more rigorous and demanding courses than other similar MOOC sites.

So far, I’m keeping up with the schedule and am in the middle of the Week 2 material. I’ve found it to be a positive experience so far, and more challenging than I’d expected!

Some initial impressions:

The course is a serious time commitment. Per week, it’s 2 hours of lecture + 2 hours for exercises + 4-12 hours for the programming assignment. I’ve chosen to skip the “interview questions” supplementary material.
Assignment grading is, thus far, very rigorous. Submitted source code is analyzed and run through a battery of tests measuring not only correctness, but code cleanliness, run times, and memory use, and scored accordingly.
The ability to submit exercises and assignments as many times as you like in order to improve your grade score is a fantastic feature. (I don’t know if all Coursera courses work this way.) It means you can really learn from your mistakes by correcting them; also, it gives you the chance to try out alternative solutions. This is WAY better than the traditional one-shot-only model of graded assignments, which is terrible for actual learning.
Basing the course on a published textbook (which is optional) is extremely helpful. There’s material covered more deeply in the text than in the lectures, but the lectures also address some aspects of topics and problems not covered in the book. This makes for a strong complementary relationship between the two; it doesn’t feel like the lectures are simply repeating the textbook.
You’re firmly expected to have some basic programming skills and a bit of math as a prerequisite. I like that the lectures keep the focus on the topics at hand, and don’t try to make the course all things to all people. If students need to “catch up” because they’re new to Java or their math is rusty, they use the discussion forums to do so.

As for the actual material, I’ve already learned a lot so far:

I’ve gotten some exposure to formal methods for algorithm analysis. A week and a half obviously isn’t going to make anyone great at this, but at least I now have some approaches for thinking through correctness, run times, and memory use mathematically, whereas before, I would mostly work empirically.
I can better identify different orders of growth and some of the common code patterns that indicate them.
The first week’s case study of different algorithms for Union-Find was, for me, a thought-provoking exercise in what is possible with arrays vs trees in representing relationships among data. The programming assignment is so stringent that it’s difficult to satisfy all the run time and memory requirements for a perfect score. This has generated a lot of insightful discussion in the forums about optimization.

Algorithms really get at the essence of what programming is. Anyone who works as a programmer has to put into practice algorithmic thinking to some degree, even if they aren’t aware of it.

I plan to continue writing about this as a way to keep me accountable for completing the two-course sequence.

A Major Update to refine_viaf

I’ve rewritten my refine_viaf project in Java. It’s what refine.codefork.com is now running. The old python code is considered deprecated and will no longer be maintained, but will remain available in the python-deprecated branch on github.

The only thing most users need to know is that refine_viaf should return better results now. For the curious, this post explains the subtle but important differences in the new version and some reasons for the rewrite.

Differences

In a nutshell, the main difference/improvement is that searches now behave more like the VIAF website.

This is due mainly to how sources (i.e. “LC” for Library of Congress) are handled. Previously, either the source specified on the URL or the “preferred source” from the config file was used to filter out search results, but it did NOT get passed into the actual VIAF search query. This could give you some weird results. The new version works like VIAF’s website: if you don’t specify a source, everything gets searched; if you do specify one, it DOES get passed to the VIAF search query. Simple.

The old version had weird rules for which name in each VIAF “cluster” result it actually displayed. In the new version, if you don’t specify a source, the most popular name (ie. the name used by the most sources) for a search result is used for display. If you specify a source, then its name is always used.

The old version supported a comma-separated list of sources at the end of the URL path. In the new version, only a single source is supported, since that’s what VIAF’s API accepts.

Lastly, the licenses are different: the python version was distributed under a BSD license. The new version is GNU GPL.

Other reasons for the rewrite

The changes above could have been implemented in python. I decided to rewrite it in Java for a few reasons:

– Overall performance is better in Java. The Django app used BeautifulSoup because VIAF’s XML used to be janky, but it appears this is no longer the case; Java’s SAX parser works great with their XML these days and is very fast. BeautifulSoup would leak memory and consume a lot of CPU, to the point where it would trigger automated warnings from my VPS provider. My server is very modest and needs to run other things, so these were real problems. Running the service as a single multi-threaded Java process keeps memory usage low and predictable, and it never spikes the CPU.

– Running a Java jar file is MUCH easier for people who want to run their own service, especially on Windows. With the python version, you had to install pip, install a bunch of packages, and create and configure a Django app, all of which put the software out of reach of many users who might want to run it.

– I don’t care what other people think: I like Java. Plus I wanted to experiment with Spring Boot. There are much leaner web frameworks I could have used to save some memory, but it was interesting to play with Spring.

Leave a comment!

If you use this thing, please take a second and leave a comment on this post. I’m interested to know how many people really run this on their own computers.

Enjoy.

A Year of Stats for refine.codefork.com

A year ago, I wrote an OpenRefine reconciliation service that queries VIAF and posted the source code on github. I’ve also been hosting it publicly at http://refine.codefork.com for anyone to use.

Below are two charts showing a year’s worth of usage statistics for this service. The first chart counts web requests made. (The way OpenRefine works, a single web request contains up to 10 name reconciliation queries. 200 web requests can translate to as many as 2000 name reconciliations.) The second chart counts the unique hosts (ie. unique computers, more or less) that used the service.

Last month, the busiest one yet, 47 different computers made an average of approximately 2360 name reconciliations each!

This usage has certainly exceeded my expectations. Now and then, I get a very nice tweet or two from someone who’s used it, which is really gratifying, considering it was just a little side project I threw together. You just never know.