Algorithms I on Coursera

I’m currently taking the “Algorithms I” course on Coursera, a session of which started on January 22nd. I thought I’d write up my impressions so far on taking my first MOOC.

As someone who taught at a university for seven years in the humanities, I should say right off the bat that I dislike the idea of online learning for the reasons you might expect. But this course appealed to me for a few reasons. It’s developed and taught by Robert Sedgewick and Kevin Wayne, the authors of the highly regarded Algorithms, 4th Edition book. The syllabi of the two-course sequence on Coursera would make for the type of semester-length course you’d find in a respectable Computer Science department. Finally, Coursera has a reputation for offering more rigorous and demanding courses than other similar MOOC sites.

So far, I’m keeping up with the schedule and am in the middle of the Week 2 material. I’ve found it to be a positive experience so far, and more challenging than I’d expected!

Some initial impressions:

  • The course is a serious time commitment. Per week, it’s 2 hours of lecture + 2 hours for exercises + 4-12 hours for the programming assignment. I’ve chosen to skip the “interview questions” supplementary material.
  • Assignment grading is, thus far, very rigorous. Submitted source code is analyzed and run through a battery of tests measuring not only correctness, but code cleanliness, run times, and memory use, and scored accordingly.
  • The ability to submit exercises and assignments as many times as you like in order to improve your grade score is a fantastic feature. (I don’t know if all Coursera courses work this way.) It means you can really learn from your mistakes by correcting them; also, it gives you the chance to try out alternative solutions. This is WAY better than the traditional one-shot-only model of graded assignments, which is terrible for actual learning.
  • Basing the course on a published textbook (which is optional) is extremely helpful. There’s material covered more deeply in the text than in the lectures, but the lectures also address some aspects of topics and problems not covered in the book. This makes for a strong complementary relationship between the two; it doesn’t feel like the lectures are simply repeating the textbook.
  • You’re firmly expected to have some basic programming skills and a bit of math as a prerequisite. I like that the lectures keep the focus on the topics at hand, and don’t try to make the course all things to all people. If students need to “catch up” because they’re new to Java or their math is rusty, they use the discussion forums to do so.

As for the actual material, I’ve already learned a lot so far:

  • I’ve gotten some exposure to formal methods for algorithm analysis. A week and a half obviously isn’t going to make anyone great at this, but at least I now have some approaches for thinking through correctness, run times, and memory use mathematically, whereas before, I would mostly work empirically.
  • I can better identify different orders of growth and some of the common code patterns that indicate them.
  • The first week’s case study of different algorithms for Union-Find was, for me, a thought-provoking exercise in what is possible with arrays vs trees in representing relationships among data. The programming assignment is so stringent that it’s difficult to satisfy all the run time and memory requirements for a perfect score. This has generated a lot of insightful discussion in the forums about optimization.

Algorithms really get at the essence of what programming is. Anyone who works as a programmer has to put into practice algorithmic thinking to some degree, even if they aren’t aware of it.

I plan to continue writing about this as a way to keep me accountable for completing the two-course sequence.

A Major Update to refine_viaf

I’ve rewritten my refine_viaf project in Java. It’s what refine.codefork.com is now running. The old python code is considered deprecated and will no longer be maintained, but will remain available in the python-deprecated branch on github.

The only thing most users need to know is that refine_viaf should return better results now. For the curious, this post explains the subtle but important differences in the new version and some reasons for the rewrite.

Differences

In a nutshell, the main difference/improvement is that searches now behave more like the VIAF website.

This is due mainly to how sources (i.e. “LC” for Library of Congress) are handled. Previously, either the source specified on the URL or the “preferred source” from the config file was used to filter out search results, but it did NOT get passed into the actual VIAF search query. This could give you some weird results. The new version works like VIAF’s website: if you don’t specify a source, everything gets searched; if you do specify one, it DOES get passed to the VIAF search query. Simple.

The old version had weird rules for which name in each VIAF “cluster” result it actually displayed. In the new version, if you don’t specify a source, the most popular name (ie. the name used by the most sources) for a search result is used for display. If you specify a source, then its name is always used.

The old version supported a comma-separated list of sources at the end of the URL path. In the new version, only a single source is supported, since that’s what VIAF’s API accepts.

Lastly, the licenses are different: the python version was distributed under a BSD license. The new version is GNU GPL.

Other reasons for the rewrite

The changes above could have been implemented in python. I decided to rewrite it in Java for a few reasons:

– Overall performance is better in Java. The Django app used BeautifulSoup because VIAF’s XML used to be janky, but it appears this is no longer the case; Java’s SAX parser works great with their XML these days and is very fast. BeautifulSoup would leak memory and consume a lot of CPU, to the point where it would trigger automated warnings from my VPS provider. My server is very modest and needs to run other things, so these were real problems. Running the service as a single multi-threaded Java process keeps memory usage low and predictable, and it never spikes the CPU.

– Running a Java jar file is MUCH easier for people who want to run their own service, especially on Windows. With the python version, you had to install pip, install a bunch of packages, and create and configure a Django app, all of which put the software out of reach of many users who might want to run it.

– I don’t care what other people think: I like Java. Plus I wanted to experiment with Spring Boot. There are much leaner web frameworks I could have used to save some memory, but it was interesting to play with Spring.

Leave a comment!

If you use this thing, please take a second and leave a comment on this post. I’m interested to know how many people really run this on their own computers.

Enjoy.

Looking at Go

In the latest stage of my exploration/deepening of programming knowledge, I’ve been looking at Go.

There’s got to be something that piques my intellectual curiosity or solves a specific problem for me to want to learn a new language. Not much about the latest “hot” languages like Ruby, Scala, and Erlang appeals to me, so I haven’t bothered with them. In real world work, I like Python as a general purpose language, and I like Java (seriously!) for large projects that need the strong tooling and frameworks available for it. Lisp and Clojure have provided useful perspective and food for thought, but in practice, they haven’t found a place in the real world software I write. Everything else I tolerate only because I have to (I’m looking at you, Javascript).

Go is extremely intriguing. It strikes me as combining some of the best things about Python and Java. It would be great not to have to choose! I like the simple syntax (not as simple as Python, alas!), the static typing, the fact that it’s compiled, and the general philosophy of favoring composition over inheritance, an idea I’ve come to support more and more. In a world currently dominated by highly dynamic, interpreted languages with very loose typing systems and a hierarchical object oriented paradigm, Go is incredibly unique! Follow the trend of languages like Clojure, Go has concurrency features that take strong advantage of multicore computing, except that its concurrency mechanisms seem much simpler. I’ve started to look at code samples and play with it a bit, and I really like what I see so far.

There’s actually a lot of negative discussions of Go on the web, but most of them are about the language in its messy pre-1.0 state. The March 1.0 release has supposedly tightened up a lot of things, and of course, performance will only get better, now that the fundamental semantics and features are solidly in place. This is an exciting time for what feels like the next evolutionary step in programming languages.

Composition and Inheritance

For as long as I can remember, writing any kind of non-trivial software meant you needed to use object-oriented programming. It was a no-brainer. So I learned all the fundamentals of OOP, and design patterns as well, since one couldn’t get very far in Java without knowing the most common patterns.

I think taking OOP for granted as the only natural way to manage complexity is why learning Lisp is so mind-blowing for programmers like myself. Take, for example, polymorphism. I didn’t know that there was anything besides parametric polymorphism–and I didn’t know it was called that; I knew it only as polymorphism, plain and simple. The ability of Lisp to do multiple dispatch was incredibly eye-opening.

I think this is the sort of thing people mean when they go on about how Lisp has broadened their horizons and deepened their understanding of concepts.

To mention another example, in a bit more detail: as an experiment in some recent python code for work, I’ve been using fewer classes/objects and more functions. (It’s debatable just how much actual “functional programming” mileage you can get out of Python, but I’ll put that aside for now.) In such an approach, you inevitably end up using a lot of composition, rather than object inheritance, to build higher level abstractions. And that’s been working out very well so far.

Composition is a powerful thing because you can control the granularity of code reuse. With a carefully constructed library of functions, you can choose to call the functions at the appropriate level of abstraction you need, and even mix and match. That’s much harder to do with object inheritance, where classes force you into an all-or-nothing package deal–if you want only some of the functionality, you need to instantiate the whole object anyway. And if you want to selectively override functionality, you need to subclass, which effectively ties the parent class to its subtree, making it harder to modify in the future.

I’ve been thinking lately that objects are useful mostly to facilitate data abstraction: there’s no reason not to group together related accessors and mutators. But when I consider more complex bundles of functionality, I think twice before creating a class hierarchy and see if I can do it using functional composition instead.

Bigger! Faster! Stronger! 3 GB in the 2.0 Ghz Macbook

The official Apple specs say that the 2.0Ghz Macbook can take up to 2 GB of memory. There’s a bit of information on the web–like this forum posting, for example–that says you can go up to 3 GB. The system board can address slightly more than 3 GB, so the 2 GB limit is reportedly lower than what the hardware is capable of. By chance, I noticed a 2 GB chip for a reasonable $40 on my local craigslist, so I decided to see for myself whether the stories were true.

It seems they are! I’ve been running the computer for a little over a week now with two chips: a 1 GB module, which had been in there before, and the new 2 GB module. It’s been put through its paces: on a given work day, I run Eclipse, Firefox, Thunderbird, Colloquy (an irc client), Adium (instant messaging client), Skype (a VoIP client), a java application server in development mode, a mysql server, and emacs, all at once.

The swap size has still been high, typically around ~500 MB, but there are no longer the delays that I used to experience with 2 GB of memory when I had a lot of apps open and switched between them. So far, there have been no problems with stability.

Here’s a screenshot from System Profiler:

Note that Apple released two different models with the 2.0 Ghz Core 2 Duo processor. You can find your machine’s model in System Profiler on the Hardware Overview screen. The 2 GB official limit applies to “Macbook2,1.” The later version, “Macbook3,1” can take up to 4 GB.

With the additional memory, this machine will hopefully last me another two years.