What’s Worse?

What’s worse: a website that is intermittently down or completely down?

The latter is worse, right? Isn’t it better that a site serve, say, 80% of requests, than 0%? This is the cloud-think we’ve all become accustomed to.

Here’s the thing: when a web application or service is intermittently down, it can hide the fact that there are any problems at all. It’s easy to dismiss problems as due to factors beyond your control, or momentary blips that will clear up on their own. In the meantime, what happens is that a user going through a sequence of, say, 6 requests to complete a workflow, will experience failure on that 6th request serviced by the one bad host or container in the cloud. And they will get frustrated and give up. And they’ll start to associate your application with being flakey and unreliable.

And you won’t notice, because it’s not happening to everyone, and the problem persists for a long while before it’s detected and fixed.

This is how the “high availability” mentality of the cloud lures you into a false sense of security.

I’ve been seeing this happen with Docker Swarm, where, under certain conditions, some newly started containers will have intermittent connectivity problems with other containers. Unless you’re paying close attention to error logs, you may not notice any problems, even though some users are definitely experiencing them.

But when a site is completely down, everyone knows, and you can’t help but address the problem.

Okay, sure, the answer of which is worse depends a lot on the type of website or web application. My point is simply that there’s often the presumption that putting things in the cloud alleviates the pressure upon individual instances of an application or service to be up and functioning correctly. This just isn’t true. And at the point where you need to care about and closely monitor individual containers because you take availability seriously, well, at that point, the cloud maybe hasn’t bought you as much as you thought it would.