January 2008 – codefork.com

The Lifespan of Software

Rumors of Chandler’s Death Are Greatly Exaggerated. So says the renowned Phillip J. Eby.

In light of all the damning media scrutiny paid to Chandler in recent years, Phillip makes an excellent point: the project funded work on a bunch of important open source python libraries. I didn’t realize this—it drastically changed my regard for the OSAF‘s work. If this aspect of the project got mentioned more, I think Chandler would get a lot more respect. Even if Chandler 1.0 never sees the light of day, it’s already made major contributions to the python community.

Proprietary software has a definite lifespan: once a company has stopped developing and supporting it, that’s the end. For the company, value is localized and non-transferable in the closed source code base. The business model of selling software depends on this. Once the company kills off the product, the value more or less disappears. You can still use it, of course, but it will decrease in value as similar, hopefully better products appear on the market.

The value of open source software, on the other hand, isn’t limited to its immediate use. Even if an application is no longer actively used and maintained, the code can spark ideas, be used to fork a new project, serve as a lesson in design, etc. Its value can be perpetually renewed by virtue of the fact that it circulates in different ways. If it’s large enough, like Chandler or Zope, it can spawn mini-projects, components, and libraries for reuse.

Years ago, I wrote a Java version of a napster server. Just for fun. It was called jnerve, and I released the code as open source. I tried to get people to host it and use it, but opennap, the C implementation, was naturally faster, more efficient, and more mature. jnerve seemed like a dead end, so I stopped working on it. There were some cool architectural bits to it that were interesting to write, but I regarded the project as a failure.

Months later at a conference, I got a demo CD of some new peer-to-peer file sharing software. (“P2P” was all the rage then.) When I ran it, I was astounded to see a copyright message with my name on it. They had used my code as the basis for their commercial product! The code was able to live on in a different form. I’m not sure it was actually legal, given that jnerve was GPL, but I didn’t care enough to pursue the matter.

The Inexact Science of Optimizing MySQL

The past week or so, I’ve been playing database administrator: monitoring load, examining the log for slow queries, tweaking parameters, rewriting queries, adding indexes, and repeating the cycle over and over again. It’s a tedious, time-consuming, and exhausting process.

Tuning is tricky because inefficiencies aren’t always immediately apparent, and there may be uncontrollable factors affecting performance in VPS environments. Here are some things I’ve learned or had to refresh myself about with respect to tuning MySQL.

Don’t assume anything about why queries are slow. Under high load, a lot of queries can pop up in the slow query log just because the database is working hard. It doesn’t necessarily mean every query needs optimization. Try to look for patterns, and expect to spend time reviewing queries that are already optimized.

Look for strange queries. Don’t recognize it? Well, maybe it shouldn’t be there. I found an expensive query for a website feature that was obsolete months ago; it was being made even though the data wasn’t being used or displayed anywhere.

EXPLAIN is your friend. Interpreting the output makes my head pound sometimes, but it’s absolutely necessary. Invest the time and effort to understand it.

Multi-column indexes are a necessary evil. Most tables I was examining had only single-column indexes, causing complex WHERE clauses and/or ORDER BY to do costly “Using temporary; Using filesort” operations. Adding a few multi-column indexes helped a great deal. In general, I dislike multi-column indexes because they require maintenance: if you modify a complex query, you might have to add or change the index, or it will become slow as molasses. But unfortunately, that’s the tradeoff for performance.

The “same” query can behave differently with different datasets. A query can sometimes use an index, and sometimes use a filesort, depending, for example, on constants in the WHERE clause. Be sure to use EXPLAIN with different values.

Give the query optimizer hints. Disk access is slow, especially on the VPS, so I wanted to avoid filesorts at all costs. Using STRAIGHT_JOIN forced MySQL to process an indexed table first; otherwise, it believed that a filesort would be faster in some cases. It might be, sometimes–but if you think you know better, use the hint.

Disk access on a VPS can be wonky. It depends a lot on what else is happening on the hardware node. This relates somewhat to the previous point: all other things being equal, MySQL could be right about a filesort being faster than an index. But it’s usually not the case in an environment where disk I/O is very costly.

Reduce the number of joins wherever possible. It doesn’t seem like a few indexed joins into “lookup” tables (basically key-value types of things) would affect performance that much, but amazingly, it can. Really. Break up queries into smaller, simpler ones if at all possible.

Run ANALYZE TABLE and OPTIMIZE TABLE. The manual says you don’t typically need to run these commands, but I found frequently updated tables benefit from this maintenance, even if they have only a few thousand records.

Work iteratively. Somtimes performance problems are localized to one or two spots, but more often, inefficiencies are spread out over a lot of queries; they may become noticeable only after several passes, as server conditions fluctuate. Discovering these problem areas takes a lot of patience.

Caching is a Workaround, not a Solution

Like every website that deals with traffic spikes, the one I’m working on these days does a lot of caching. This past week I’ve spent a lot of time reviewing the caching code as well as tuning the database, to get the site working efficiently on a newly upgraded virtual private server.

The following occurred to me: as wonderful and necessary as caching is, it’s fundamentally a workaround. The core problem is having insufficient resources. Given enough CPU and memory, you wouldn’t ever need to cache. It’s when those resources are insufficient for a particular traffic load that caching becomes immensely helpful. That’s why it’s a workaround: it practically addresses the problem, but it doesn’t really solve it. And it’s not a perfect solution: simple caching mechanisms usually introduce a lag time in the currency of content.

Why does this matter? Because caching shouldn’t substitute for efficient code. That is, uncached operations should still try to make the best use of resources as possible. Otherwise, caching turns into a panacea, luring you into a false sense of security about how well the guts of the application really perform. Ideally, caching should always be added as an afterthought on top of already well abstracted code.

Maintainability Pitfalls in PHP

Tim Bray makes this prediction about PHP for 2008:

PHP will remain popular but its growth will slow, as people get nervous about its maintainability and security stories.

I share Tim’s love/hate relationship with PHP. It’s definitely a powerful and easy language. But,

… speaking as an actual computer programmer, I really dislike PHP. I find it ugly and un-modular and there’s something about it that encourages people to write horrible code. We’re talking serious maintainability pain.

I’m seeing this right now in some code I’ve recently taken over. The previous programmer was quite skilled and did a great job, but it’s clear there are some areas he had to write quickly and hack together. The flip side of PHP’s ease of use is that sloppiness accumulates very quickly when you’re doing things in a hurry. To some extent, that’s an unavoidable aspect of a growing codebase. But there’s also specific things about PHP itself that foster disorganization and unmaintainability:

* The lack of namespaces. This makes it hard to quickly locate a function or class definition. Classes can be used as namespaces, but that’s a hack, and leads to ugly un-OOPish uses of classes. PHP could really benefit from packages or modules.

* While PHP5 has vastly improved its object functionality, it often feels like the developer culture remains mired in a function-oriented paradigm. PHP’s relative ease of use and wide availability on commodity webhosting has produced a huge pool of developers whose skills are pretty wide-ranging. The low end of that tends towards hacky, function-oriented code that simply “gets the job done.” I’d like to see more thoughtful discussion on PHP sites and forums about object design and philosophy, about when to use functions and classes, and about how to mix them up harmoniously.

* Having a library of thousands of built-in functions in a global namespace with little rhyme or reason to their naming doesn’t exactly provide a great model of maintainability.

* extract() should die. Die, die, die.

* There’s not much agreement about OOP performance: some insist that heavy usage of some OOP features slows PHP down a lot, so you should avoid them whenever possible. Which not only is plain dumb but leads to deliberately confusing and half-assed uses of OOP in the name of better performance.

Maintainability is a matter of discipline, since you can write sloppy code in any language. That aside, PHP does make it extra hard to keep things orderly. I think CakePHP is a step in the right direction, though if you’re going to use a strict MVC architecture, you might as well dump PHP and just go with Ruby on Rails or Python.