Two Styles of Caching (PHP’s Cache_Lite vs memcached)

Since the recent slashdotting of our website (we held up okay, but there’s always room for improvement), I’ve been investigating the possibility of moving from Cache_Lite (actually, Cache_Lite_Function) to memcached in our PHP code.

Much discussion comparing these solutions focuses on raw performance in benchmarks. In the real world, though, not all things outside the benchmark are equal. On a VPS, disk I/O times are notorious for being highly variable. This makes memcached all the more attractive. Yes, memory is faster than disk in almost every environment, but also, avoiding disk access conserves a precious resource so fewer processes must block for it.

A public mailing list post by one Brian Moon points this out exactly:

If you rolled your own caching system on the local filesystem, benchmarks would show that it is faster. However, what you do not see in benchmarks is what happens to your FS under load. Your kernel has to use a lot of resources to do all that file IO. […]

So, enter memcached. It scales much better than a file based cache. Sure, its slower. I have even seen some tests where its slower than the database. But, tests are not the real world. In the real world, memcached does a great job.

Okay, great. memcached is better when you take into account overall resources. But there’s a very useful Cache_Lite_Function feature that memcached doesn’t seem to have.

When you initialize a Cache_Lite_Function object, you set a “lifeTime” parameter, then use the call() method to wrap your regular function calls. If the output of the function hasn’t been cached within that time period, the call gets made and its results replaced in the cache with a new timestamp.

The cool thing about it is that you can create different cache objects pointing to the same directory store without a problem. Pages can increase and decrease the lifetime of the cache dynamically as load changes, so you can serve slightly older data from cache if necessary, keeping the site responsive while saving database queries. On a site where content changes relatively infrequently, this is a great feature to have: serve it fresh when load is low, serve from cache when load is high.

memcached, on the other hand, requires that you specify an expiration time at the time you place data in the cache. A retrieval call doesn’t let you specify a time period, so you can’t do the above. If data has expired, it’s expired.

It’d be interesting to hack Cache_Lite_Function to use memcached as its store, so you could get the best of both worlds. It would involve storing things in memcached with no expiration, tacking on a timestamp in the data, and doing the checking manually. But it might work.

There’s no such thing as a content management system

During a meeting at work today, someone remarked, “No one I know seems happy with their content management system.”

Somehow, that’s unsurprising. The problem, I think, is that there’s really no such thing as a content management system. Think about how absurd that term is. It’s a system (it’s organized and has structure) that manages (performs operations) on content (er, stuff). Well then… what piece of software isn’t a CMS?!

When people talk about a CMS, they really mean publishing software. The website I maintain was written specifically for managing news articles. It does its job reasonably well, despite needing some cleanup and refactoring. What’s devious about the term “CMS” is that people start to expect all sorts of things from it. After all, it manages content right? So why can’t it easily integrate with other sites, offer social networking features, do fancy AJAX tricks, and make dinner, with cpu cycles to spare?

The fact is, no software can do it all. There’s sometimes the wishful thinking that if we were using a pre-packaged CMS instead of a custom solution, we’d be better off. That’s just not true. A pre-packaged CMS can be a good option for simple needs, but customization is often a huge headache. The end result is that you’d have been better off writing something custom tailored to begin with. The most flexible (and therefore “best”) pre-packaged CMSes are often not ready-to-run software, but actually well-designed frameworks (like Zope) that require coding for the specific content you want to handle.

So why is no one happy with what they have? I suspect it’s because they didn’t give enough thought to what they wanted, or their expectations were too high, or both.

There’s nothing magical about a CMS. It follows the same rules as any other kind of software: the requirements for what it does should be clear, and the proper code abstractions should be in place. It’s like any other project: it should support a set of features, but also be able to change and grow easily. And you can only achieve those goals with proper planning and good code design. Not confusing lingo like “content management system.”

The Lifespan of Software

Rumors of Chandler’s Death Are Greatly Exaggerated. So says the renowned Phillip J. Eby.

In light of all the damning media scrutiny paid to Chandler in recent years, Phillip makes an excellent point: the project funded work on a bunch of important open source python libraries. I didn’t realize this—it drastically changed my regard for the OSAF‘s work. If this aspect of the project got mentioned more, I think Chandler would get a lot more respect. Even if Chandler 1.0 never sees the light of day, it’s already made major contributions to the python community.

Proprietary software has a definite lifespan: once a company has stopped developing and supporting it, that’s the end. For the company, value is localized and non-transferable in the closed source code base. The business model of selling software depends on this. Once the company kills off the product, the value more or less disappears. You can still use it, of course, but it will decrease in value as similar, hopefully better products appear on the market.

The value of open source software, on the other hand, isn’t limited to its immediate use. Even if an application is no longer actively used and maintained, the code can spark ideas, be used to fork a new project, serve as a lesson in design, etc. Its value can be perpetually renewed by virtue of the fact that it circulates in different ways. If it’s large enough, like Chandler or Zope, it can spawn mini-projects, components, and libraries for reuse.

Years ago, I wrote a Java version of a napster server. Just for fun. It was called jnerve, and I released the code as open source. I tried to get people to host it and use it, but opennap, the C implementation, was naturally faster, more efficient, and more mature. jnerve seemed like a dead end, so I stopped working on it. There were some cool architectural bits to it that were interesting to write, but I regarded the project as a failure.

Months later at a conference, I got a demo CD of some new peer-to-peer file sharing software. (“P2P” was all the rage then.) When I ran it, I was astounded to see a copyright message with my name on it. They had used my code as the basis for their commercial product! The code was able to live on in a different form. I’m not sure it was actually legal, given that jnerve was GPL, but I didn’t care enough to pursue the matter.

The Inexact Science of Optimizing MySQL

The past week or so, I’ve been playing database administrator: monitoring load, examining the log for slow queries, tweaking parameters, rewriting queries, adding indexes, and repeating the cycle over and over again. It’s a tedious, time-consuming, and exhausting process.

Tuning is tricky because inefficiencies aren’t always immediately apparent, and there may be uncontrollable factors affecting performance in VPS environments. Here are some things I’ve learned or had to refresh myself about with respect to tuning MySQL.

Don’t assume anything about why queries are slow. Under high load, a lot of queries can pop up in the slow query log just because the database is working hard. It doesn’t necessarily mean every query needs optimization. Try to look for patterns, and expect to spend time reviewing queries that are already optimized.

Look for strange queries. Don’t recognize it? Well, maybe it shouldn’t be there. I found an expensive query for a website feature that was obsolete months ago; it was being made even though the data wasn’t being used or displayed anywhere.

EXPLAIN is your friend. Interpreting the output makes my head pound sometimes, but it’s absolutely necessary. Invest the time and effort to understand it.

Multi-column indexes are a necessary evil. Most tables I was examining had only single-column indexes, causing complex WHERE clauses and/or ORDER BY to do costly “Using temporary; Using filesort” operations. Adding a few multi-column indexes helped a great deal. In general, I dislike multi-column indexes because they require maintenance: if you modify a complex query, you might have to add or change the index, or it will become slow as molasses. But unfortunately, that’s the tradeoff for performance.

The “same” query can behave differently with different datasets. A query can sometimes use an index, and sometimes use a filesort, depending, for example, on constants in the WHERE clause. Be sure to use EXPLAIN with different values.

Give the query optimizer hints. Disk access is slow, especially on the VPS, so I wanted to avoid filesorts at all costs. Using STRAIGHT_JOIN forced MySQL to process an indexed table first; otherwise, it believed that a filesort would be faster in some cases. It might be, sometimes–but if you think you know better, use the hint.

Disk access on a VPS can be wonky. It depends a lot on what else is happening on the hardware node. This relates somewhat to the previous point: all other things being equal, MySQL could be right about a filesort being faster than an index. But it’s usually not the case in an environment where disk I/O is very costly.

Reduce the number of joins wherever possible. It doesn’t seem like a few indexed joins into “lookup” tables (basically key-value types of things) would affect performance that much, but amazingly, it can. Really. Break up queries into smaller, simpler ones if at all possible.

Run ANALYZE TABLE and OPTIMIZE TABLE. The manual says you don’t typically need to run these commands, but I found frequently updated tables benefit from this maintenance, even if they have only a few thousand records.

Work iteratively. Somtimes performance problems are localized to one or two spots, but more often, inefficiencies are spread out over a lot of queries; they may become noticeable only after several passes, as server conditions fluctuate. Discovering these problem areas takes a lot of patience.