Category Archives: python

The Myth of Artisanal Programming

Paul Chiusano, the author of the excellent Functional Programming in Scala from Manning (one of the few tech publishers I buy from; worth every penny), recently wrote a blog post titled, “The advantages of static typing, simply stated”.

Lately all I seem to do is rant to people about this exact topic. Paul’s post is way more succinct than anything I can write, so go over there and read it.

While he takes pains to give a balanced treatment of static vs dynamic type systems, it seems much more cut and dry to me. Dynamic languages are easier and faster for development when you’re getting started on a project, and it’s great if that project never gets very big. But they scale very poorly, for all the reasons he describes. Recently, I had the daunting task of reading almost ~10k lines of Perl code (pretty good Perl, in my opinion). It was hard to make sense of and figure out how to modify and extend, whereas the MUCH larger Java codebase (over 100k lines, if I recall) that I worked with years ago felt very manageable.

My own history as a programmer matches Paul’s very closely. I started with Java, which was annoying but not a bad language by any means. Then Python came along and seemed like a liberation from Java’s rigidity and verbosity. But Python, Ruby and others are showing their weaknesses, and it’s no mystery why people are turning to the newer generation of statically typed languages like Scala, Haskell, Go, etc.

People who haven’t been around as long don’t necessarily have this perspective.

In retrospect, it’s interesting to me how we programmers “got sold” on dynamic languages, from a cultural perspective. You might recall that a big selling point was using simple text editors rather than IDEs, and there was this sense that writing code this way made you closer to the software somehow. Java was corporate, while Python was hand-crafted. There was a vague implicit notion of “artisanal” programming in these circles.

The upshot, of course, is that every time you read a chunk of code or call a function or method, your brain has to do a lot of the work that a statically typed language would be able to enforce and verify for you. But in a dynamic language, you won’t know what happens until the code runs. In large measure, the quality of software hinges on how much you can tell, a priori, about code before it runs at all. In a dynamic world, anything can happen, and often does.

This is a nightmare, pure and simple. Much of the strong focus on writing automated tests is to basically make up for the lack of static typing.

True artisanship lies in design: namely, thinking hard about the data structures and code organization you’re committing to. It’s not about being able to take liberties that can result in things that make no sense to the machine and that can cause errors at runtime that could have been caught beforehand.

A Major Update to refine_viaf

I’ve rewritten my refine_viaf project in Java. It’s what refine.codefork.com is now running. The old python code is considered deprecated and will no longer be maintained, but will remain available in the python-deprecated branch on github.

The only thing most users need to know is that refine_viaf should return better results now. For the curious, this post explains the subtle but important differences in the new version and some reasons for the rewrite.

Differences

In a nutshell, the main difference/improvement is that searches now behave more like the VIAF website.

This is due mainly to how sources (i.e. “LC” for Library of Congress) are handled. Previously, either the source specified on the URL or the “preferred source” from the config file was used to filter out search results, but it did NOT get passed into the actual VIAF search query. This could give you some weird results. The new version works like VIAF’s website: if you don’t specify a source, everything gets searched; if you do specify one, it DOES get passed to the VIAF search query. Simple.

The old version had weird rules for which name in each VIAF “cluster” result it actually displayed. In the new version, if you don’t specify a source, the most popular name (ie. the name used by the most sources) for a search result is used for display. If you specify a source, then its name is always used.

The old version supported a comma-separated list of sources at the end of the URL path. In the new version, only a single source is supported, since that’s what VIAF’s API accepts.

Lastly, the licenses are different: the python version was distributed under a BSD license. The new version is GNU GPL.

Other reasons for the rewrite

The changes above could have been implemented in python. I decided to rewrite it in Java for a few reasons:

– Overall performance is better in Java. The Django app used BeautifulSoup because VIAF’s XML used to be janky, but it appears this is no longer the case; Java’s SAX parser works great with their XML these days and is very fast. BeautifulSoup would leak memory and consume a lot of CPU, to the point where it would trigger automated warnings from my VPS provider. My server is very modest and needs to run other things, so these were real problems. Running the service as a single multi-threaded Java process keeps memory usage low and predictable, and it never spikes the CPU.

– Running a Java jar file is MUCH easier for people who want to run their own service, especially on Windows. With the python version, you had to install pip, install a bunch of packages, and create and configure a Django app, all of which put the software out of reach of many users who might want to run it.

– I don’t care what other people think: I like Java. Plus I wanted to experiment with Spring Boot. There are much leaner web frameworks I could have used to save some memory, but it was interesting to play with Spring.

Leave a comment!

If you use this thing, please take a second and leave a comment on this post. I’m interested to know how many people really run this on their own computers.

Enjoy.

Managing Dependencies in Python vs Ruby

Ruby's Bundler tool is amazing.

Ruby’s Bundler tool is amazing.

With Python projects, the standard way of doing things is to set up a virtualenv and use pip to install packages from PyPI specified in a requirements.txt file. This way, each of your project’s dependencies are kept separate, installed in their own directories in isolated sandboxed environments.

This works pretty well. But sometimes, when I am debugging a third party package, I want to be able to get the source code from git and use it instead of the package from PyPI, so I can make changes, troubleshoot, experiment, etc. This is a pain in the butt. You have to remove the installed package and either 1) install manually (and repeatedly, as you work) from your cloned repository, or 2) add the repository directory to your Python library path somehow. Then you have to undo these changes to go back to using the PyPI package. Either way, it’s clunky and annoying.

Ruby’s bundler tool has a very different approach to dependencies. It, too, downloads appropriate versions of gems (which is what packages are called), which are listed in a Gemfile. But unlike pip, it can store multiple versions of a gem, and even let you specify that a gem lives in a github or local repository; moreover, it makes the right packages available each time you run your program! That is, each time you run the “bundle exec” wrapper to run Rails or anything else, it sets up a custom set of directories for Ruby’s library path that point ONLY to the versions you want, ignoring the others.

I did this today when trying to pin down the source of some deprecation warnings I was seeing after some gem upgrades. My Gemfile had these lines in it:

gem 'sunspot_rails', '~> 2.1.0'
gem 'sunspot_solr', '~> 2.1.0'

I cloned the sunspot repository containing those gems. Then I ran:

# bundle config local.sunspot_rails ~/sunspot
# bundle config local.sunspot_solr ~/sunspot

And changed the Gemfile lines:

gem 'sunspot_rails', :github => 'sunspot/sunspot_rails', :branch => 'master'
gem 'sunspot_solr', :github => 'sunspot/sunspot_solr', :branch => 'master'

Finally, I ran “bundler update”. That’s it! I could make changes to my cloned repository, restart Rails, and see the changes immediately.

When I was done messing around, I changed my Gemfile back, ran “bundler update” again, and I was back to using my original gems.

Being able to work so easily with third party code allowed me to quickly figure out where the deprecated calls were being made and file an issue with the sunspot project.

A VIAF Reconciliation Service for OpenRefine

open-refine

OpenRefine is a wonderful tool my coworkers have been using to clean data for my project at work. Our workflow has been nice and simple: they take a CSV dump from a database, transform the data in OpenRefine, and export it as CSV. I write scripts to detect the changes and update the database with the new data.

We have a need, in the next few months, to reconcile the names of various individuals and organizations with standard “universal” identifiers for them in the Virtual International Authority File. The tricky part is that any given name in our system might have several candidates in VIAF, so it can’t be a fully automated process. A human being needs to look at them and make a decision. OpenRefine allows you to do this reconciliation, and also provides an interface that lets you choose among candidates.

Communicating with VIAF is not built in, though. Roderic D. M. Page wrote a VIAF reconciliation service, and it’s publicly accessible at the address listed on the linked page (the PHP source code is available here). It works very nicely.

I wanted to write my own version for 2 reasons: 1) I needed it to support the different name types in VIAF, 2) I wanted to host it myself, in case I needed to make large numbers of queries, so as not to be an obnoxious burden on Page’s server.

The project is called refine_viaf and the source code is available at https://github.com/codeforkjeff/refine_viaf.

For those who just want to use it without hosting their own installation, I’ve also made the service publicly accessible at http://refine.codefork.com, where there are instructions on how to configure OpenRefine to use it.

Goodbye, Sublime Text?

When one of my coworkers started using Sublime Text about a year ago, I was intrigued. I played with it and found it to be a very featureful and speedy editor. I wasn’t compelled enough to make the switch from Emacs, though. (You’ll pry it from my cold dead hands!) But I really liked the fact that you could write plugins for it in Python.

So for fun, I gradually ported my emacs library, which integrates with a bunch of custom development tools at work, to Sublime Text. It works very well, and the ST users in the office have been happy with it. Although I don’t actually use ST regularly, I’ve since been following news about its development.

What I discovered is that many of its users are unhappy with the price tag and dissatisfied with the support they received via the forums. So much so, in fact, that there’s now an attempt to create an open source clone by reverse engineering it. The project is named lime.

I learned about this with very mixed feelings. There’s a good chance the project will take off, given how much frustration exists with ST. Of course, the trend is nothing new: open source software has been supplanting closed source commercial software for a long time now. But this isn’t Microsoft or Oracle we’re talking about; it’s a very small company, charging what I think is a reasonable amount of money for their product. While they undoubtedly could do more to make their users happier, I imagine that they probably can’t do so without hurting what I imagine are pretty slim profit margins. That, or not sleeping ever again.

It’s not news that making a software product is much less viable than it used to be. Where money is made, it’s increasingly through consulting and customization, but one wonders about the size of that market.

It’s generally a good thing that open source has “socialized” software development: technology has enabled communities of programmers to contribute and collaborate on a large scale, in a highly distributed fashion, to create good quality software available to all, taking it out of the profit equation. The problem is that the rest of the economy hasn’t caught up with this new kind of economics.

I don’t mean to sound dramatic: there are many jobs out there for programmers, of course. But it saddens me that if you want to try to create a product to sell, it’s simply not enough to have a good idea anymore, in this day and age. It has to be dirt cheap or free, you have to respond to every message immediately, and respect every single feature request. Between the open source world and the big software companies that service corporate customers, there is a vast middle ground of small companies that is quickly vanishing.

Looking at Go

In the latest stage of my exploration/deepening of programming knowledge, I’ve been looking at Go.

There’s got to be something that piques my intellectual curiosity or solves a specific problem for me to want to learn a new language. Not much about the latest “hot” languages like Ruby, Scala, and Erlang appeals to me, so I haven’t bothered with them. In real world work, I like Python as a general purpose language, and I like Java (seriously!) for large projects that need the strong tooling and frameworks available for it. Lisp and Clojure have provided useful perspective and food for thought, but in practice, they haven’t found a place in the real world software I write. Everything else I tolerate only because I have to (I’m looking at you, Javascript).

Go is extremely intriguing. It strikes me as combining some of the best things about Python and Java. It would be great not to have to choose! I like the simple syntax (not as simple as Python, alas!), the static typing, the fact that it’s compiled, and the general philosophy of favoring composition over inheritance, an idea I’ve come to support more and more. In a world currently dominated by highly dynamic, interpreted languages with very loose typing systems and a hierarchical object oriented paradigm, Go is incredibly unique! Follow the trend of languages like Clojure, Go has concurrency features that take strong advantage of multicore computing, except that its concurrency mechanisms seem much simpler. I’ve started to look at code samples and play with it a bit, and I really like what I see so far.

There’s actually a lot of negative discussions of Go on the web, but most of them are about the language in its messy pre-1.0 state. The March 1.0 release has supposedly tightened up a lot of things, and of course, performance will only get better, now that the fundamental semantics and features are solidly in place. This is an exciting time for what feels like the next evolutionary step in programming languages.

Monitoring DSL diagnostics on a ZyXEL P-600 Series modem

The DSL at the house has been really flakey the past 3 days. The line seems to periodically drop and I also noticed that the voice line had a lot of static.

I suspected a line problem so I called Earthlink support last night to try to get it straightened out. Surprisingly, the line tested good up until the point where it came into the house. But the diagnostics on the DSL modem were showing low noise margin (signal-to-noise ratio) and high attenuation (degradation), so there was definitely a problem somewhere. The trouble had to be inside the house: either the DSL modem or some wiring somewhere had gone bad or both.

I tried changing out some cables and tweaking the way the phone and fax (my housemate runs her own business) were all hooked up into the line. That seems to have fixed the problem. The line’s been steady for almost 24 hours now. I’ve been watching the diagnostics and researching what the numbers mean, and they seem to be within acceptable-to-good ranges. So far so good.

It was annoying to have to load the web interface on the ZyXEL P-600 Series modem (it’s a P-660R-ELNK) and continually refresh the page to watch the diagnostics. Plus I had to remember if any of the numbers changed and by how much. So I whipped up a little python script to fetch the data from the web interface and do some logging.

There were some interesting peculiarities in the ZyXEL web interface. After authenticating with a password, only that computer’s IP can use the interface, apparently; requests from other hosts get locked out until a few minutes of inactivity. Also, the way the interface refreshed the diagnostics page was a bit odd: it used a form submission to set some state that would cause another page to update its contents.

The script output looks like this:

Sun Jan 17 19:15:37 2010 noise = 16 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:16:37 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:17:38 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:18:38 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:19:38 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:20:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:21:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:22:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:23:39 2010 noise = 18 (good) outputPower = 11 attenuation = 31 (very good)
Sun Jan 17 19:24:39 2010 noise = 17 (good) outputPower = 11 attenuation = 31 (very good)

A copy of the script is available here. You may need to do some tweaking to get it to work with your setup. It works with python 2.5 and 2.6.

FixedGearGallery Index 2.0

I created a new interface for my FixedGearGallery Index. What better way to procrastinate than spending a few hours on code?

The original purpose of the index was to provide an easy way to browse through the relevant pages of a particular make/model on FGG. My first version accomplished that goal, but it’s awfully clunky. After using it a while, I discovered how annoying it was to toggle between windows and keep track of where I was in the list.

The new version places navigation controls in a small area at the top of the page. It loads content from FGG into an iframe, eliminating the need for switching among windows. And the previous/next links allow you to browse sequentially, making it much easier to keep track of what you’ve already seen.

It’s not perfect but it’s definitely an improvement. I have fancier ideas for organizing FGG content but I don’t want to go too far by pirating Dennis’ site. I’m grateful he gave me permission to do the index at all when I emailed him about it a few months ago.

The Lifespan of Software

Rumors of Chandler’s Death Are Greatly Exaggerated. So says the renowned Phillip J. Eby.

In light of all the damning media scrutiny paid to Chandler in recent years, Phillip makes an excellent point: the project funded work on a bunch of important open source python libraries. I didn’t realize this—it drastically changed my regard for the OSAF‘s work. If this aspect of the project got mentioned more, I think Chandler would get a lot more respect. Even if Chandler 1.0 never sees the light of day, it’s already made major contributions to the python community.

Proprietary software has a definite lifespan: once a company has stopped developing and supporting it, that’s the end. For the company, value is localized and non-transferable in the closed source code base. The business model of selling software depends on this. Once the company kills off the product, the value more or less disappears. You can still use it, of course, but it will decrease in value as similar, hopefully better products appear on the market.

The value of open source software, on the other hand, isn’t limited to its immediate use. Even if an application is no longer actively used and maintained, the code can spark ideas, be used to fork a new project, serve as a lesson in design, etc. Its value can be perpetually renewed by virtue of the fact that it circulates in different ways. If it’s large enough, like Chandler or Zope, it can spawn mini-projects, components, and libraries for reuse.

Years ago, I wrote a Java version of a napster server. Just for fun. It was called jnerve, and I released the code as open source. I tried to get people to host it and use it, but opennap, the C implementation, was naturally faster, more efficient, and more mature. jnerve seemed like a dead end, so I stopped working on it. There were some cool architectural bits to it that were interesting to write, but I regarded the project as a failure.

Months later at a conference, I got a demo CD of some new peer-to-peer file sharing software. (“P2P” was all the rage then.) When I ran it, I was astounded to see a copyright message with my name on it. They had used my code as the basis for their commercial product! The code was able to live on in a different form. I’m not sure it was actually legal, given that jnerve was GPL, but I didn’t care enough to pursue the matter.

Maintainability Pitfalls in PHP

Tim Bray makes this prediction about PHP for 2008:

PHP will remain popular but its growth will slow, as people get nervous about its maintainability and security stories.

I share Tim’s love/hate relationship with PHP. It’s definitely a powerful and easy language. But,

… speaking as an actual computer programmer, I really dislike PHP. I find it ugly and un-modular and there’s something about it that encourages people to write horrible code. We’re talking serious maintainability pain.

I’m seeing this right now in some code I’ve recently taken over. The previous programmer was quite skilled and did a great job, but it’s clear there are some areas he had to write quickly and hack together. The flip side of PHP’s ease of use is that sloppiness accumulates very quickly when you’re doing things in a hurry. To some extent, that’s an unavoidable aspect of a growing codebase. But there’s also specific things about PHP itself that foster disorganization and unmaintainability:

* The lack of namespaces. This makes it hard to quickly locate a function or class definition. Classes can be used as namespaces, but that’s a hack, and leads to ugly un-OOPish uses of classes. PHP could really benefit from packages or modules.

* While PHP5 has vastly improved its object functionality, it often feels like the developer culture remains mired in a function-oriented paradigm. PHP’s relative ease of use and wide availability on commodity webhosting has produced a huge pool of developers whose skills are pretty wide-ranging. The low end of that tends towards hacky, function-oriented code that simply “gets the job done.” I’d like to see more thoughtful discussion on PHP sites and forums about object design and philosophy, about when to use functions and classes, and about how to mix them up harmoniously.

* Having a library of thousands of built-in functions in a global namespace with little rhyme or reason to their naming doesn’t exactly provide a great model of maintainability.

* extract() should die. Die, die, die.

* There’s not much agreement about OOP performance: some insist that heavy usage of some OOP features slows PHP down a lot, so you should avoid them whenever possible. Which not only is plain dumb but leads to deliberately confusing and half-assed uses of OOP in the name of better performance.

Maintainability is a matter of discipline, since you can write sloppy code in any language. That aside, PHP does make it extra hard to keep things orderly. I think CakePHP is a step in the right direction, though if you’re going to use a strict MVC architecture, you might as well dump PHP and just go with Ruby on Rails or Python.