Engineering

Engineering is about making tradeoffs. You consider all the factors, optimize for the ones that are most important, and make the best choices based on that.

Put in that way, engineering is pretty damn boring. It’s not easy. But being difficult doesn’t necessarily make something interesting.

In my younger days, I put a lot of energy into exploring the latest fads and keeping up with “best practices,” all in the hopes of becoming a better software engineer. I do that a lot less these days–only as much as I need to for work. Anyone who genuinely gets off on “best practices” is a bit suspect in my book.

At some point, I think my interest in coding changed from “is this well written software?” to something like “what kind of thinking does this code imply about how it solves the problem?”

This is way more interesting! Even so-called “bad” code can be very interesting and teach you something about a unique way to approach the problem it solves.

Obviously, we can’t spend all our time appreciating various aspects of less-than-optimal code. There’s work to be done, efficiency to be measured, best practices to follow. It’s not like we got into coding because we enjoyed thinking or anything.

Generator Expressions

I discovered Python generator expressions this week. They’ve been around since v2.4, way back in 2004, and yet, somehow, they managed to escape my notice until now. I don’t think I’ve ever seen them used in any codebase I’ve worked on.

Generator expressions are almost the same as list comprehension syntax, except instead of square brackets, you use parentheses. And instead of returning a list, the comprehension returns a generator object.

This is quite a nice bit of syntactic sugar that helps to keep things more succinct than writing iterators and generator functions. But it’s not as succinct as, say, the chained method syntax you find in Scala or Ruby, which I think is vastly superior in terms of clarity. Several years ago, I wrote up some notes about Ruby streams in a github repo for my co-workers because it was a valuable technique in some data processing scripts we were working on. (Since then, I haven’t done much Ruby.)

Here’s a bit of Python code that duplicates the simple Ruby example in the linked repo above, with the identical output. Again, it’s not nearly as nice to look at as chained method calls, but still better than having to write separate generator functions or classes.

nums = list(range(1,6))

add10 = lambda i: print(f"adding 10 to {i}") or i + 10
filter_even = lambda i: print(f"filtering {i} on evenness") or i % 2 == 0
output = lambda i: print(f"in each: {i}")

# list comprehensions

r1 = [ add10(x) for x in nums ]
r2 = [ x for x in r1 if filter_even(x) ]
r3 = [ output(x) for x in r2 ]

# generator expressions

r1 = ( add10(x) for x in nums )
r2 = ( x for x in r1 if filter_even(x) )
r3 = ( output(x) for x in r2 )
list(r3)

Data as Coping Mechanism

Life under this pandemic has been hard. Oddly, one of the things that’s helped me deal is to play around with the coronavirus data. The numbers in the U.S. are horrifying, of course, but they’ve also been soothing at a technical level, maybe because working with the data is somewhat different from the work I do for my job. It’s also been neat to do hands-on validation of reporting in the media and various claims made about trends. I’ve been showing my results to friends who seem to find them insightful.

The repository is here. There are links in the README to the charts and visualizations.

Some technical reflections on this little hobby project:

I used Python and SQLite. Apache Spark seemed like overkill and I haven’t spent enough time with Spark to be able to troubleshoot intermediate pipeline steps as easily as I can in a plain SQL database. SQLite is fantastic for doing ETL or ELT. I can’t recommend it enough for non-“big data” scenarios. It’s fast (if you use local disk access), has enough SQL features for most ELT/ETL work, and is well-suited for use by a single user. It’s also good if the end goal is to produce data that will ultimately get imported into another system, say, a full-fledged RDBMS data warehouse that serves multiple users.

Currently, with just over 5 months of county- and state-level data, it takes ~2 minutes to to load all the raw data, transform it into dimensional tables that calculate various measures, and create data files used by the web pages that display tables and charts. The SQLite file is 850 MB which includes a lot of stage tables. This is on my laptop with a i5-7300U processor. Not too bad.

I created a Makefile to handle dependencies in the data pipeline, so that it only re-runs parts as needed. It’s currently not as fine-grained as it could be. For example, any change in the JHU CSSE data files will reload ALL the files into the database, but that portion of the code takes only maybe 10s total anyway. Similarly, all the dimensional models are created in a single process and could be split out. I’m happy with how it turned out overall with the qualification that writing and maintaining the Makefile is a bit of a pain. I might try using Apache Airflow instead at some point.

Storing data files in a git repo feels gross. But I did this so the chart and map web pages served through GitHub Pages could load the static data files. It’s a simple and free hosting solution.

In general, I like how simple this setup turned out to be and how easy it’s been to add new measures or tweak existing ones.

EDIT: In November 2020, I switched to using BigQuery.

Initial Thoughts on Data Engineering

Last year, I transitioned to doing data engineering and data warehousing work. It’s been an interesting journey so far—I’m still very much learning—but I thought I would make a post about the insights I’ve had into the nature of this work and the surprises I’ve encountered.

First off, anyone interested in this topic should read Maxime Beauchemin’s article, “The Rise of the Data Engineer”. It’s a great overview of this role’s emergence within the growing field of data science. He writes: “Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization. Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.”

I can’t speak to data science just yet, as that is relatively new to me as well. But I’ve been doing software engineering for a while now, so my reflections here come from that perspective.

Thinking in Sets

When I started, I thought, how hard can this be? Isn’t it just writing SQL? I’ve done that. Piece of cake, right? Yes and no.

Though I’ve worked a lot with transactional databases, writing complex queries for ETL and reporting purposes requires a very different mindset. Especially when you are writing stored procedures composed of statements that join several tables, pivot the result, transform and filter rows using ranking window functions, and then union a bunch of tables together. It takes some time to train your brain to map higher-level operations to the crazy-looking SELECT statements or lengthy groups of common table expressions that perform them. Much of this is due to SQL being such an odd beast compared to the mainstream object-oriented languages of the day (more on that below).

Relational databases are all about sets, so you have to resist your imperative impulses. For example, instead of doing things iteratively in a loop containing IF statements inside it, you can probably express the same thing using a cartesian product and WHERE clauses. It’s usually faster, the code is more compact, and it takes advantage of the set-oriented nature of the language. Joe Celko’s Thinking in Sets is a good book that’s helped me adjust to, well, thinking in sets. I highly recommend it.

Programming Skills

So after these initial experiences, my thinking changed to: “okay, this stuff really is a different animal than software development.” Again, yes and no.

I’ve noticed that functional programming and “big data” trends have had a lot of influence on data engineering. Traditional practices from data warehousing, shaped in large part by Kimball, are still relevant, but they’ve also been changing in response to these new developments. Designing ETL pipelines with immutability in mind, a core concept from FP, makes them much easier to understand and troubleshoot. The irony here is that while declarative languages have been getting a lot of attention in recent years, SQL often goes unrecognized as a member of this category, even as it’s been around forever. (Because let’s face it, SQL just isn’t sexy.)

The data warehouse I work on is a custom-built system of stored procedures in SQL Server. This gives us tremendous control and flexibility that we wouldn’t have with an off-the-shelf warehousing or ETL product. It also means I write a lot of scripts in PowerShell and R to do preprocessing, loads, builds, updates, etc. While those languages are new to me, having a background in coding has helped tremendously with hitting the ground running. Beauchemin ntoes the general move away from GUI-based products towards code, a trend which I wholeheartedly support: “There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.”

Along those lines, designing tables and relationships is basically an exercise in abstraction, sharing a lot of similarities with designing data structures and object classes. However, a major annoyance I’ve found with SQL databases is that I often find I can’t achieve the same degree of abstraction as with other languages and technologies. You can’t group together fields from a table and work with that group repeatedly. There’s no table inheritance and table “typing”, in a way that allows you to use only certain types of tables (those implementing a certain set of columns, for example) in a query that you want to restrict in this way. The limits on abstraction also limit code reuse.

I suspect these hindrances are a major reason for the move away from SQL towards other languages to do data processing.

A Priori Guarantees and Empirical Validation

The biggest difference is actually a subtle one that’s taken me a long time to identify and name.

Compiled languages like Java give you a lot of a priori guarantees. This lets you create very modular code and also have confidence that the pieces work together in extremely well-defined ways.

When working with data pipelines, the pieces are much more loosely coupled. I find it more challenging to reason about the potential effects of changes you make at any given stage of a pipeline, especially when data flows out of one system into another. So I find myself doing a lot more validation at each stage to make up for this. I try to get unexpected consequences to raise errors instead of just “failing” silently (where “fail” here means that a JOIN or a WHERE clause that simply stops matching). This means a lot more manual work, unfortunately. I imagine people out there are thinking about how to effectively handle upstream changes in a sane way, but I’m not quite there yet.

Making Tools

In software development, I’ve always felt more comfortable doing back-end rather than front-end work. Data engineering fits into that mindset very well. At almost every job I’ve had, I jumped at the chance to create tools and utilities. This has always been an auxiliary thing, but I’m finding it to be more front and center in data engineering, which is really enjoyable and gratifying to me.

Some people hate writing the “glue” pieces to get systems to work, but I love it. They’re often chances to think about architecture, modularity, interfaces, and optimization. There’s something very satisfying about that sort of thinking. It feels more like computing, in a world where that term has almost completely lost meaning.

On the way to conciliator 3.0

When I started writing conciliator, I focused solely on implementing the main URL endpoint for reconciliation, since that’s what I needed at the time. There are actually several APIs that work together in OpenRefine: the Preview, Suggest, and recently added Data Extension APIs provide functionality that complements Reconciliation. And there may be more in the future as OpenRefine continues to evolve.

My code didn’t extend very easily, so I’ve rewritten a ton of stuff. This is currently in the master branch and it’s running on http://refine.codefork.com. I will release 3.0 probably sometime in January.

Most users don’t need to do anything. Out of the box, it should work as the previous version did. If you modified the conciliator.properties file, you should look at the changes in that file.

Here are the changes under the hood so far:

– Spring controllers and components are now used more effectively for better separation of concerns, less manual plumbing, better extensibility and maintainability.
– The classes representing data in/out for the various APIs are more fully fleshed out.
– The conciliator.properties file allows for less configurability than before, but I don’t know how useful that ever was, really.
– Tests have been rewritten for more “real world” coverage.
– Custom cache code has been replaced with Ehcache.
– Requires Java 8.

With this scaffolding in place, I can start actually implementing Data Extension for specific data sources. I plan to start with OpenLibrary as it provides the richest data.

Stay tuned.

Thanks to everyone who contributed bug reports and suggestions! I’m amazed and gratified that this software gets used as much as it does.