Data as Coping Mechanism

Life under this pandemic has been hard. Oddly, one of the things that’s helped me deal is to play around with the coronavirus data. The numbers in the U.S. are horrifying, of course, but they’ve also been soothing at a technical level, maybe because working with the data is somewhat different from the work I do for my job. It’s also been neat to do hands-on validation of reporting in the media and various claims made about trends. I’ve been showing my results to friends who seem to find them insightful.

The repository is here. There are links in the README to the charts and visualizations.

Some technical reflections on this little hobby project:

I used Python and SQLite. Apache Spark seemed like overkill and I haven’t spent enough time with Spark to be able to troubleshoot intermediate pipeline steps as easily as I can in a plain SQL database. SQLite is fantastic for doing ETL or ELT. I can’t recommend it enough for non-“big data” scenarios. It’s fast (if you use local disk access), has enough SQL features for most ELT/ETL work, and is well-suited for use by a single user. It’s also good if the end goal is to produce data that will ultimately get imported into another system, say, a full-fledged RDBMS data warehouse that serves multiple users.

Currently, with just over 5 months of county- and state-level data, it takes ~2 minutes to to load all the raw data, transform it into dimensional tables that calculate various measures, and create data files used by the web pages that display tables and charts. The SQLite file is 850 MB which includes a lot of stage tables. This is on my laptop with a i5-7300U processor. Not too bad.

I created a Makefile to handle dependencies in the data pipeline, so that it only re-runs parts as needed. It’s currently not as fine-grained as it could be. For example, any change in the JHU CSSE data files will reload ALL the files into the database, but that portion of the code takes only maybe 10s total anyway. Similarly, all the dimensional models are created in a single process and could be split out. I’m happy with how it turned out overall with the qualification that writing and maintaining the Makefile is a bit of a pain. I might try using Apache Airflow instead at some point.

Storing data files in a git repo feels gross. But I did this so the chart and map web pages served through GitHub Pages could load the static data files. It’s a simple and free hosting solution.

In general, I like how simple this setup turned out to be and how easy it’s been to add new measures or tweak existing ones.

Initial Thoughts on Data Engineering

Last year, I transitioned to doing data engineering and data warehousing work. It’s been an interesting journey so far—I’m still very much learning—but I thought I would make a post about the insights I’ve had into the nature of this work and the surprises I’ve encountered.

First off, anyone interested in this topic should read Maxime Beauchemin’s article, “The Rise of the Data Engineer”. It’s a great overview of this role’s emergence within the growing field of data science. He writes: “Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization. Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.”

I can’t speak to data science just yet, as that is relatively new to me as well. But I’ve been doing software engineering for a while now, so my reflections here come from that perspective.

Thinking in Sets

When I started, I thought, how hard can this be? Isn’t it just writing SQL? I’ve done that. Piece of cake, right? Yes and no.

Though I’ve worked a lot with transactional databases, writing complex queries for ETL and reporting purposes requires a very different mindset. Especially when you are writing stored procedures composed of statements that join several tables, pivot the result, transform and filter rows using ranking window functions, and then union a bunch of tables together. It takes some time to train your brain to map higher-level operations to the crazy-looking SELECT statements or lengthy groups of common table expressions that perform them. Much of this is due to SQL being such an odd beast compared to the mainstream object-oriented languages of the day (more on that below).

Relational databases are all about sets, so you have to resist your imperative impulses. For example, instead of doing things iteratively in a loop containing IF statements inside it, you can probably express the same thing using a cartesian product and WHERE clauses. It’s usually faster, the code is more compact, and it takes advantage of the set-oriented nature of the language. Joe Celko’s Thinking in Sets is a good book that’s helped me adjust to, well, thinking in sets. I highly recommend it.

Programming Skills

So after these initial experiences, my thinking changed to: “okay, this stuff really is a different animal than software development.” Again, yes and no.

I’ve noticed that functional programming and “big data” trends have had a lot of influence on data engineering. Traditional practices from data warehousing, shaped in large part by Kimball, are still relevant, but they’ve also been changing in response to these new developments. Designing ETL pipelines with immutability in mind, a core concept from FP, makes them much easier to understand and troubleshoot. The irony here is that while declarative languages have been getting a lot of attention in recent years, SQL often goes unrecognized as a member of this category, even as it’s been around forever. (Because let’s face it, SQL just isn’t sexy.)

The data warehouse I work on is a custom-built system of stored procedures in SQL Server. This gives us tremendous control and flexibility that we wouldn’t have with an off-the-shelf warehousing or ETL product. It also means I write a lot of scripts in PowerShell and R to do preprocessing, loads, builds, updates, etc. While those languages are new to me, having a background in coding has helped tremendously with hitting the ground running. Beauchemin ntoes the general move away from GUI-based products towards code, a trend which I wholeheartedly support: “There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.”

Along those lines, designing tables and relationships is basically an exercise in abstraction, sharing a lot of similarities with designing data structures and object classes. However, a major annoyance I’ve found with SQL databases is that I often find I can’t achieve the same degree of abstraction as with other languages and technologies. You can’t group together fields from a table and work with that group repeatedly. There’s no table inheritance and table “typing”, in a way that allows you to use only certain types of tables (those implementing a certain set of columns, for example) in a query that you want to restrict in this way. The limits on abstraction also limit code reuse.

I suspect these hindrances are a major reason for the move away from SQL towards other languages to do data processing.

A Priori Guarantees and Empirical Validation

The biggest difference is actually a subtle one that’s taken me a long time to identify and name.

Compiled languages like Java give you a lot of a priori guarantees. This lets you create very modular code and also have confidence that the pieces work together in extremely well-defined ways.

When working with data pipelines, the pieces are much more loosely coupled. I find it more challenging to reason about the potential effects of changes you make at any given stage of a pipeline, especially when data flows out of one system into another. So I find myself doing a lot more validation at each stage to make up for this. I try to get unexpected consequences to raise errors instead of just “failing” silently (where “fail” here means that a JOIN or a WHERE clause that simply stops matching). This means a lot more manual work, unfortunately. I imagine people out there are thinking about how to effectively handle upstream changes in a sane way, but I’m not quite there yet.

Making Tools

In software development, I’ve always felt more comfortable doing back-end rather than front-end work. Data engineering fits into that mindset very well. At almost every job I’ve had, I jumped at the chance to create tools and utilities. This has always been an auxiliary thing, but I’m finding it to be more front and center in data engineering, which is really enjoyable and gratifying to me.

Some people hate writing the “glue” pieces to get systems to work, but I love it. They’re often chances to think about architecture, modularity, interfaces, and optimization. There’s something very satisfying about that sort of thinking. It feels more like computing, in a world where that term has almost completely lost meaning.

On the way to conciliator 3.0

When I started writing conciliator, I focused solely on implementing the main URL endpoint for reconciliation, since that’s what I needed at the time. There are actually several APIs that work together in OpenRefine: the Preview, Suggest, and recently added Data Extension APIs provide functionality that complements Reconciliation. And there may be more in the future as OpenRefine continues to evolve.

My code didn’t extend very easily, so I’ve rewritten a ton of stuff. This is currently in the master branch and it’s running on http://refine.codefork.com. I will release 3.0 probably sometime in January.

Most users don’t need to do anything. Out of the box, it should work as the previous version did. If you modified the conciliator.properties file, you should look at the changes in that file.

Here are the changes under the hood so far:

– Spring controllers and components are now used more effectively for better separation of concerns, less manual plumbing, better extensibility and maintainability.
– The classes representing data in/out for the various APIs are more fully fleshed out.
– The conciliator.properties file allows for less configurability than before, but I don’t know how useful that ever was, really.
– Tests have been rewritten for more “real world” coverage.
– Custom cache code has been replaced with Ehcache.
– Requires Java 8.

With this scaffolding in place, I can start actually implementing Data Extension for specific data sources. I plan to start with OpenLibrary as it provides the richest data.

Stay tuned.

Thanks to everyone who contributed bug reports and suggestions! I’m amazed and gratified that this software gets used as much as it does.

What’s Worse?

What’s worse: a website that is intermittently down or completely down?

The latter is worse, right? Isn’t it better that a site serve, say, 80% of requests, than 0%? This is the cloud-think we’ve all become accustomed to.

Here’s the thing: when a web application or service is intermittently down, it can hide the fact that there are any problems at all. It’s easy to dismiss problems as due to factors beyond your control, or momentary blips that will clear up on their own. In the meantime, what happens is that a user going through a sequence of, say, 6 requests to complete a workflow, will experience failure on that 6th request serviced by the one bad host or container in the cloud. And they will get frustrated and give up. And they’ll start to associate your application with being flakey and unreliable.

And you won’t notice, because it’s not happening to everyone, and the problem persists for a long while before it’s detected and fixed.

This is how the “high availability” mentality of the cloud lures you into a false sense of security.

I’ve been seeing this happen with Docker Swarm, where, under certain conditions, some newly started containers will have intermittent connectivity problems with other containers. Unless you’re paying close attention to error logs, you may not notice any problems, even though some users are definitely experiencing them.

But when a site is completely down, everyone knows, and you can’t help but address the problem.

Okay, sure, the answer of which is worse depends a lot on the type of website or web application. My point is simply that there’s often the presumption that putting things in the cloud alleviates the pressure upon individual instances of an application or service to be up and functioning correctly. This just isn’t true. And at the point where you need to care about and closely monitor individual containers because you take availability seriously, well, at that point, the cloud maybe hasn’t bought you as much as you thought it would.

True Empowerment

I fixed a bug in the blacklight-marc gem recently. It involved this line of Ruby code:

vals << (v == 'AD') ? 'Atlas' : 'Map'

Contrary to what it looks like, this line adds a boolean value to the vals array. The << operation returns true, so the entire line of code always evaluates to ‘Atlas’. Then nothing happens with that string.

Obviously, this isn’t what was intended. The problem is that << has higher precedence than the if-else operators. So here’s the fix:

vals << (v == 'AD' ? 'Atlas' : 'Map')

This code path wasn’t being taken all the time, and it also didn’t raise any exceptions: the calling code uses the result as an array of strings, so the booleans get automatically converted to “true” and “false” strings. I just happened to notice those weird values where they didn’t make sense, and thought to dig into it.

Let’s be honest: this is the kind of mistake anyone could easily make. I’m 100% certain I’ve done something similar. In fact, I innocently asked some co-workers what the original line of code did, and of course, they interpreted it incorrectly. It’s a tricky little bug.

I thought to post about this because it’s a perfect example of how, in a loosely typed, dynamic language like Ruby, you’re really on your own.

Dynamic languages can often feel “empowering” because they place trust in the programmer. It’s your responsibility not to write code that does anything really crazy or stupid. But there are a lot of these “gotcha” cases, where you’re writing code that’s quite reasonable, and you simply made a mistake that the language lets you get away with, because it’s interpeted differently from what you intended. It’s valid code. And you won’t figure it out until much later, when it shows up as a symptom elsewhere.

By contrast, with Java or Scala, you wouldn’t be able to do this. The compiler would check the types, and meaningfully, say, “Sorry buddy, it doesn’t make sense to me to add a boolean to a List of Strings,” and you’d immediately notice the problem with operator precedence. And you’d fix it.

Your program would never even be able to run with that error in it. Which is some awfully nice work that the language is doing for you there. That feels like true empowerment to me.

Final note: you could argue that good test coverage would catch this. That’s true, but we all know the difficulties of achieving thorough test coverage under deadlines. And this example is particularly annoying to get thorough coverage for, because the line of code is one case of many different cases of values for the variable ‘v’.