Initial Thoughts on Data Engineering

Last year, I transitioned to doing data engineering and data warehousing work. It’s been an interesting journey so far—I’m still very much learning—but I thought I would make a post about the insights I’ve had into the nature of this work and the surprises I’ve encountered.

First off, anyone interested in this topic should read Maxime Beauchemin’s article, “The Rise of the Data Engineer”. It’s a great overview of this role’s emergence within the growing field of data science. He writes: “Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization. Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.”

I can’t speak to data science just yet, as that is relatively new to me as well. But I’ve been doing software engineering for a while now, so my reflections here come from that perspective.

Thinking in Sets

When I started, I thought, how hard can this be? Isn’t it just writing SQL? I’ve done that. Piece of cake, right? Yes and no.

Though I’ve worked a lot with transactional databases, writing complex queries for ETL and reporting purposes requires a very different mindset. Especially when you are writing stored procedures composed of statements that join several tables, pivot the result, transform and filter rows using ranking window functions, and then union a bunch of tables together. It takes some time to train your brain to map higher-level operations to the crazy-looking SELECT statements or lengthy groups of common table expressions that perform them. Much of this is due to SQL being such an odd beast compared to the mainstream object-oriented languages of the day (more on that below).

Relational databases are all about sets, so you have to resist your imperative impulses. For example, instead of doing things iteratively in a loop containing IF statements inside it, you can probably express the same thing using a cartesian product and WHERE clauses. It’s usually faster, the code is more compact, and it takes advantage of the set-oriented nature of the language. Joe Celko’s Thinking in Sets is a good book that’s helped me adjust to, well, thinking in sets. I highly recommend it.

Programming Skills

So after these initial experiences, my thinking changed to: “okay, this stuff really is a different animal than software development.” Again, yes and no.

I’ve noticed that functional programming and “big data” trends have had a lot of influence on data engineering. Traditional practices from data warehousing, shaped in large part by Kimball, are still relevant, but they’ve also been changing in response to these new developments. Designing ETL pipelines with immutability in mind, a core concept from FP, makes them much easier to understand and troubleshoot. The irony here is that while declarative languages have been getting a lot of attention in recent years, SQL often goes unrecognized as a member of this category, even as it’s been around forever. (Because let’s face it, SQL just isn’t sexy.)

The data warehouse I work on is a custom-built system of stored procedures in SQL Server. This gives us tremendous control and flexibility that we wouldn’t have with an off-the-shelf warehousing or ETL product. It also means I write a lot of scripts in PowerShell and R to do preprocessing, loads, builds, updates, etc. While those languages are new to me, having a background in coding has helped tremendously with hitting the ground running. Beauchemin ntoes the general move away from GUI-based products towards code, a trend which I wholeheartedly support: “There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.”

Along those lines, designing tables and relationships is basically an exercise in abstraction, sharing a lot of similarities with designing data structures and object classes. However, a major annoyance I’ve found with SQL databases is that I often find I can’t achieve the same degree of abstraction as with other languages and technologies. You can’t group together fields from a table and work with that group repeatedly. There’s no table inheritance and table “typing”, in a way that allows you to use only certain types of tables (those implementing a certain set of columns, for example) in a query that you want to restrict in this way. The limits on abstraction also limit code reuse.

I suspect these hindrances are a major reason for the move away from SQL towards other languages to do data processing.

A Priori Guarantees and Empirical Validation

The biggest difference is actually a subtle one that’s taken me a long time to identify and name.

Compiled languages like Java give you a lot of a priori guarantees. This lets you create very modular code and also have confidence that the pieces work together in extremely well-defined ways.

When working with data pipelines, the pieces are much more loosely coupled. I find it more challenging to reason about the potential effects of changes you make at any given stage of a pipeline, especially when data flows out of one system into another. So I find myself doing a lot more validation at each stage to make up for this. I try to get unexpected consequences to raise errors instead of just “failing” silently (where “fail” here means that a JOIN or a WHERE clause that simply stops matching). This means a lot more manual work, unfortunately. I imagine people out there are thinking about how to effectively handle upstream changes in a sane way, but I’m not quite there yet.

Making Tools

In software development, I’ve always felt more comfortable doing back-end rather than front-end work. Data engineering fits into that mindset very well. At almost every job I’ve had, I jumped at the chance to create tools and utilities. This has always been an auxiliary thing, but I’m finding it to be more front and center in data engineering, which is really enjoyable and gratifying to me.

Some people hate writing the “glue” pieces to get systems to work, but I love it. They’re often chances to think about architecture, modularity, interfaces, and optimization. There’s something very satisfying about that sort of thinking. It feels more like computing, in a world where that term has almost completely lost meaning.

What’s Worse?

What’s worse: a website that is intermittently down or completely down?

The latter is worse, right? Isn’t it better that a site serve, say, 80% of requests, than 0%? This is the cloud-think we’ve all become accustomed to.

Here’s the thing: when a web application or service is intermittently down, it can hide the fact that there are any problems at all. It’s easy to dismiss problems as due to factors beyond your control, or momentary blips that will clear up on their own. In the meantime, what happens is that a user going through a sequence of, say, 6 requests to complete a workflow, will experience failure on that 6th request serviced by the one bad host or container in the cloud. And they will get frustrated and give up. And they’ll start to associate your application with being flakey and unreliable.

And you won’t notice, because it’s not happening to everyone, and the problem persists for a long while before it’s detected and fixed.

This is how the “high availability” mentality of the cloud lures you into a false sense of security.

I’ve been seeing this happen with Docker Swarm, where, under certain conditions, some newly started containers will have intermittent connectivity problems with other containers. Unless you’re paying close attention to error logs, you may not notice any problems, even though some users are definitely experiencing them.

But when a site is completely down, everyone knows, and you can’t help but address the problem.

Okay, sure, the answer of which is worse depends a lot on the type of website or web application. My point is simply that there’s often the presumption that putting things in the cloud alleviates the pressure upon individual instances of an application or service to be up and functioning correctly. This just isn’t true. And at the point where you need to care about and closely monitor individual containers because you take availability seriously, well, at that point, the cloud maybe hasn’t bought you as much as you thought it would.

Lumen: A port of Blacklight to Scala and the Play framework

I’ve done some work the past two years using Blacklight, a great discovery interface for Solr with a lot of library catalog features. It’s quality software with years of work invested in it by some very smart people.

Over time, “the Ruby way” of doing things, as well as “the Rails way,” has bugged me more and more. Things like the use of naming conventions for hooks, passing arrays and hashes around in lieu of actual data structures, the varying use of hashes and OpenStructs, the ability to monkey patch, the difficulty of looking at a method and not being able to tell what its arguments are or can be, the need to go digging around in the source code of a gem to figure out how certain APIs are dynamically created because those methods can’t get automatically documented by tools like rubydoc or yard. These things often make life easier when you’re writing new code and trying to do it quickly, but they create nightmares when you try to refactor stuff or upgrade gem dependencies.

The last few months, I’ve been slowly porting Blacklight to Scala and the Play framework. I’m calling this new project Lumen.

Scala is a powerful statically typed, compiled language that permits you to mix object-oriented and functional paradigms, and it allows you to take advantage of the enormous ecosystem of existing Java libraries. It also incorporates some of the innovations of the last two decades of dynamic languages that make programmers happy. I think it’s a language that privileges software quality over rapid development.

So far, Lumen has been a hobby project to learn Scala, so I’ve approached it in a less disciplined fashion than I otherwise might. This means there are lots of TODOs scattered throughout, and style/design inconsistencies as I’ve learned better ways to do things but haven’t always gone back to change things everywhere. That’s life when you’re noodling around in your spare time. Lumen is largely an experiment right now, but I hope it will eventually grow into a full-featured, production-quality piece of software. We’ll see.

You can check out a demo here: http://lumen-demo.codefork.com.

Announcing conciliator

I’ve just created a github repository for conciliator, a growing collection of OpenRefine reconciliation services, as well as a Java framework for creating them.

conciliator is a major refactoring of my refine_viaf project and supercedes it. This new project cleanly separates the VIAF-specific parts and the more “boilerplate” pieces needed for any OpenRefine reconciliation service. The result is a framework that allows you to easily write new reconciliation services. My intent here is to make some existing code way more flexible, so that it might be useful to more users and have a longer lifespan.

http://refine.codefork.com has already been running conciliator for a week now; if you’ve been using it, you don’t need to make any changes in OpenRefine.

Currently, conciliator out-of-the-box can query VIAF exactly like refine_viaf does, down to the same URLs. Additionally, conciliator can now query ORCID names. This was a somewhat arbitrary choice; I’ve been doing some ORCID integration at work so it was convenient for me to implement a data source for it as a proof of concept.

With VIAF and ORCID, conciliator acts as an intermediate or “bridge” service, but it would be possible to use conciliator to query other types of data sources as well: files, SQL databases, etc. Right now, you’d have to write your own code to read and parse files, open database connections, etc. But in the future, I hope to add support for these options to make them easier to implement.

For details on how to write your own service in Java using conciliator, see the README.

Are there data sources you’d like to see available as a reconciliation service? Leave a comment to this post. No promises, but I’ll at least consider all requests. And if you write your own service for a data source, please consider submitting your code as a pull request so that others can use it too!

The Myth of Artisanal Programming

Paul Chiusano, the author of the excellent Functional Programming in Scala from Manning (one of the few tech publishers I buy from; worth every penny), recently wrote a blog post titled, “The advantages of static typing, simply stated”.

Lately all I seem to do is rant to people about this exact topic. Paul’s post is way more succinct than anything I can write, so go over there and read it.

While he takes pains to give a balanced treatment of static vs dynamic type systems, it seems much more cut and dry to me. Dynamic languages are easier and faster for development when you’re getting started on a project, and it’s great if that project never gets very big. But they scale very poorly, for all the reasons he describes. Recently, I had the daunting task of reading almost ~10k lines of Perl code (pretty good Perl, in my opinion). It was hard to make sense of and figure out how to modify and extend, whereas the MUCH larger Java codebase (over 100k lines, if I recall) that I worked with years ago felt very manageable.

My own history as a programmer matches Paul’s very closely. I started with Java, which was annoying but not a bad language by any means. Then Python came along and seemed like a liberation from Java’s rigidity and verbosity. But Python, Ruby and others are showing their weaknesses, and it’s no mystery why people are turning to the newer generation of statically typed languages like Scala, Haskell, Go, etc.

People who haven’t been around as long don’t necessarily have this perspective.

In retrospect, it’s interesting to me how we programmers “got sold” on dynamic languages, from a cultural perspective. You might recall that a big selling point was using simple text editors rather than IDEs, and there was this sense that writing code this way made you closer to the software somehow. Java was corporate, while Python was hand-crafted. There was a vague implicit notion of “artisanal” programming in these circles.

The upshot, of course, is that every time you read a chunk of code or call a function or method, your brain has to do a lot of the work that a statically typed language would be able to enforce and verify for you. But in a dynamic language, you won’t know what happens until the code runs. In large measure, the quality of software hinges on how much you can tell, a priori, about code before it runs at all. In a dynamic world, anything can happen, and often does.

This is a nightmare, pure and simple. Much of the strong focus on writing automated tests is to basically make up for the lack of static typing.

True artisanship lies in design: namely, thinking hard about the data structures and code organization you’re committing to. It’s not about being able to take liberties that can result in things that make no sense to the machine and that can cause errors at runtime that could have been caught beforehand.