Initial Thoughts on Data Engineering

Last year, I transitioned to doing data engineering and data warehousing work. It’s been an interesting journey so far—I’m still very much learning—but I thought I would make a post about the insights I’ve had into the nature of this work and the surprises I’ve encountered.

First off, anyone interested in this topic should read Maxime Beauchemin’s article, “The Rise of the Data Engineer”. It’s a great overview of this role’s emergence within the growing field of data science. He writes: “Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization. Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science.”

I can’t speak to data science just yet, as that is relatively new to me as well. But I’ve been doing software engineering for a while now, so my reflections here come from that perspective.

Thinking in Sets

When I started, I thought, how hard can this be? Isn’t it just writing SQL? I’ve done that. Piece of cake, right? Yes and no.

Though I’ve worked a lot with transactional databases, writing complex queries for ETL and reporting purposes requires a very different mindset. Especially when you are writing stored procedures composed of statements that join several tables, pivot the result, transform and filter rows using ranking window functions, and then union a bunch of tables together. It takes some time to train your brain to map higher-level operations to the crazy-looking SELECT statements or lengthy groups of common table expressions that perform them. Much of this is due to SQL being such an odd beast compared to the mainstream object-oriented languages of the day (more on that below).

Relational databases are all about sets, so you have to resist your imperative impulses. For example, instead of doing things iteratively in a loop containing IF statements inside it, you can probably express the same thing using a cartesian product and WHERE clauses. It’s usually faster, the code is more compact, and it takes advantage of the set-oriented nature of the language. Joe Celko’s Thinking in Sets is a good book that’s helped me adjust to, well, thinking in sets. I highly recommend it.

Programming Skills

So after these initial experiences, my thinking changed to: “okay, this stuff really is a different animal than software development.” Again, yes and no.

I’ve noticed that functional programming and “big data” trends have had a lot of influence on data engineering. Traditional practices from data warehousing, shaped in large part by Kimball, are still relevant, but they’ve also been changing in response to these new developments. Designing ETL pipelines with immutability in mind, a core concept from FP, makes them much easier to understand and troubleshoot. The irony here is that while declarative languages have been getting a lot of attention in recent years, SQL often goes unrecognized as a member of this category, even as it’s been around forever. (Because let’s face it, SQL just isn’t sexy.)

The data warehouse I work on is a custom-built system of stored procedures in SQL Server. This gives us tremendous control and flexibility that we wouldn’t have with an off-the-shelf warehousing or ETL product. It also means I write a lot of scripts in PowerShell and R to do preprocessing, loads, builds, updates, etc. While those languages are new to me, having a background in coding has helped tremendously with hitting the ground running. Beauchemin ntoes the general move away from GUI-based products towards code, a trend which I wholeheartedly support: “There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.”

Along those lines, designing tables and relationships is basically an exercise in abstraction, sharing a lot of similarities with designing data structures and object classes. However, a major annoyance I’ve found with SQL databases is that I often find I can’t achieve the same degree of abstraction as with other languages and technologies. You can’t group together fields from a table and work with that group repeatedly. There’s no table inheritance and table “typing”, in a way that allows you to use only certain types of tables (those implementing a certain set of columns, for example) in a query that you want to restrict in this way. The limits on abstraction also limit code reuse.

I suspect these hindrances are a major reason for the move away from SQL towards other languages to do data processing.

A Priori Guarantees and Empirical Validation

The biggest difference is actually a subtle one that’s taken me a long time to identify and name.

Compiled languages like Java give you a lot of a priori guarantees. This lets you create very modular code and also have confidence that the pieces work together in extremely well-defined ways.

When working with data pipelines, the pieces are much more loosely coupled. I find it more challenging to reason about the potential effects of changes you make at any given stage of a pipeline, especially when data flows out of one system into another. So I find myself doing a lot more validation at each stage to make up for this. I try to get unexpected consequences to raise errors instead of just “failing” silently (where “fail” here means that a JOIN or a WHERE clause that simply stops matching). This means a lot more manual work, unfortunately. I imagine people out there are thinking about how to effectively handle upstream changes in a sane way, but I’m not quite there yet.

Making Tools

In software development, I’ve always felt more comfortable doing back-end rather than front-end work. Data engineering fits into that mindset very well. At almost every job I’ve had, I jumped at the chance to create tools and utilities. This has always been an auxiliary thing, but I’m finding it to be more front and center in data engineering, which is really enjoyable and gratifying to me.

Some people hate writing the “glue” pieces to get systems to work, but I love it. They’re often chances to think about architecture, modularity, interfaces, and optimization. There’s something very satisfying about that sort of thinking. It feels more like computing, in a world where that term has almost completely lost meaning.

What’s Worse?

What’s worse: a website that is intermittently down or completely down?

The latter is worse, right? Isn’t it better that a site serve, say, 80% of requests, than 0%? This is the cloud-think we’ve all become accustomed to.

Here’s the thing: when a web application or service is intermittently down, it can hide the fact that there are any problems at all. It’s easy to dismiss problems as due to factors beyond your control, or momentary blips that will clear up on their own. In the meantime, what happens is that a user going through a sequence of, say, 6 requests to complete a workflow, will experience failure on that 6th request serviced by the one bad host or container in the cloud. And they will get frustrated and give up. And they’ll start to associate your application with being flakey and unreliable.

And you won’t notice, because it’s not happening to everyone, and the problem persists for a long while before it’s detected and fixed.

This is how the “high availability” mentality of the cloud lures you into a false sense of security.

I’ve been seeing this happen with Docker Swarm, where, under certain conditions, some newly started containers will have intermittent connectivity problems with other containers. Unless you’re paying close attention to error logs, you may not notice any problems, even though some users are definitely experiencing them.

But when a site is completely down, everyone knows, and you can’t help but address the problem.

Okay, sure, the answer of which is worse depends a lot on the type of website or web application. My point is simply that there’s often the presumption that putting things in the cloud alleviates the pressure upon individual instances of an application or service to be up and functioning correctly. This just isn’t true. And at the point where you need to care about and closely monitor individual containers because you take availability seriously, well, at that point, the cloud maybe hasn’t bought you as much as you thought it would.

Annoyances in Xubuntu 16.04 LTS

This week, I installed Xubuntu on a new work computer. I’d previously sworn off Ubuntu, but I admit, I’m crawling back now… the reality is that Ubuntu has smoothed out many of the rough edges that I’m simply not willing to deal with at work. Sigh.

Even as generally polished as Xubuntu is, I did encounter a few hiccups.

1) To adjust settings for the screen locking software, light-locker, I needed to make sure the light-locker-settings package was installed. Nothing happened when I selected “Light Locker Settings” from the whisker menu, though, because it was crashing. I ran “light-locker-settings” via a terminal, and saw some python error messages.

Python was trying to import a module from python-gobject, which wasn’t installed and wasn’t a prerequisite for light-locker-settings for some reason.

After that error went away, I got another one about a missing function. To fix it, you have to manually patch two lines in a python file, as described in this bug report. [NOTE: This has been fixed as of 7/20/2016, in version 1.5.0-0ubuntu1.1 of light-locker-settings]

2) Another light-locker quirk: the mouse pointer becomes invisible when I lock the screen by hitting Ctrl-Alt-Del and then unlock it. To make it visible again, hit Ctrl-Alt-F1 to switch to a text console and then Ctrl-Alt-F7 to return to Xfce.

3) The “Greybird” theme is notorious for making it VERY difficult to resize windows by dragging the handles that appear when you mouse-over the window edges and bottom corners. The pointer has to be EXACTLY on an edge or corner; it won’t display the resize handle if you’re slightly off.

For reasons I don’t understand, the devs seem intent on not changing this. But enough users have complained that the Xubuntu blog even has a post about alternative ways to resize windows. The disregard for user experience here is simply mind-blowing.

I’ve grudgingly started using the Alt and right-click drag combo to resize windows.

Addendum:

4) Intermittent DNS problems: hostnames on our internal domain weren’t always resolving. This seems like a common problem on Ubuntu caused by dnsmasq. The solution is to disable it by commenting out the line “dns=dnsmasq” in /etc/NetworkManager/NetworkManager.conf and rebooting.

Taking A Fresh Look at PHP

I’ve recently started working on a PHP/Laravel project.

PHP isn’t new to me. Many years ago, I wrote a very simple online catalog and shopping cart, from scratch, for a friend who had his own business as a rare book dealer. He used it with much success for several years. I’d also done a bit of hacking on some Drupal plugins.

Coming back to PHP now, I’m finding myself in a world MUCH different than the one I’d left.

First off, let’s admit that PHP comes with a lot of baggage. For a long time, “real” programmers shunned PHP because it was born as a language cobbled together to do simple web development but not much more. Its ease of use, combined with the fact that it was easy to deploy on commodity web hosting, meant you could find PHP talent for relatively cheap to build your applications. The stereotype was that PHP developers relied on a lot of patchy copy-and-paste solutions to build shoddy and insecure websites.

A LOT has happened since then. Here’s what I’ve encountered so far, diving back into PHP:

Object-orientation: PHP has had objects for a long time, but more recent features like namespaces, traits, and class autoloading have made newer PHP projects very strongly object-oriented. You can even find books on design patterns for PHP.

To me, this is the single most important positive change to the PHP world. The culture has changed from an ad hoc procedural mindset to more sophisticated thinking about coding for large-scale architectures.

Frameworks: Several major MVC frameworks exist, many of them drawing inspiration from Rails.

Performance: As of 5.5, PHP has a built-in opcode cache, making it much more performant. An alternative to core PHP is the HHVM project, backed by Facebook, which is a high-performance PHP implementation. HHVM has had a “rising tide” effect: the forthcoming PHP7 is supposed to be as fast as HHVM. So whatever you use, you can expect good performance at scale.

Tooling: There is sophisticated tooling like composer and a vibrant ecosystem of packages. While you can still deploy PHP applications the old way, using Apache and mod_php, there is a mature FastCGI Process Manager (PHP-FPM) engine that isolates PHP processes from the web server. PHP-FPM allows Apache/nginx/whatever web server to handle static content while a pool of processes handles PHP requests. This results in much more efficient memory usage and better performance.

Success: Many respectable, high-profile products have been built using PHP: WordPress, Drupal, and Facebook, just to name a few.

But all this is just to state a bunch of known facts. To me, the biggest suprise has been in the EXPERIENCE of beginning to write code again in PHP and using Laravel: what does that FEEL like?

In a word, it feels like Java, minus the strong typing. This is an entirely good thing in my opinion, despite criticisms that PHP technologies have become too complex and overdesigned.

The biggest paradigm difference between PHP and other popular web application back-ends is that nothing remains loaded in memory between requests. It’s true that opcode caching means PHP doesn’t have to re-compile PHP source code files to opcodes every time, which speeds things up greatly, but the opcodes still need to be executed for each request, from bootstrapping the application to finishing the HTTP response. In practice, this doesn’t actually matter, but it’s such a significant under-the-hood difference from, say, Django and Rails, that I find myself thinking about it from time to time.

It’s reassuring that when I scour the interwebs researching something PHP-related, I’m finding a lot of informed technical discussions and smart people who have come to PHP from other languages and backgrounds. It bodes well for the strengths and the future of the technology.

On Magic

Kids, I hate to break it to you, but there is no such thing as magic.

The cool whizzy stuff on your screen that impresses you: that’s the result of work. The button that was broken yesterday, that now works correctly today: also the product of work. The screen that was discussed in a meeting last week that suddenly appeared today on the development server: yup, work. When you look for a feature in the web application and it isn’t there, there’s this thing that can create it and put it there: it’s called work.

Someday we’ll all get over the mystifying aura of technology. Someday people will learn to recognize that programmers are not magicians, just workers, and that the work they do involves mundane, non-magical tasks, like wrestling with code libraries and frameworks to get them to do what we want, reorganizing files to make sure stuff exists in sensible places, and figuring out what to do when changing one piece affects three other pieces in unexpected ways.

And this means, someday, people will understand that, like any other kind of work, software development takes resources (namely, time!), not a magic wand. And no amount of “ambition” (read: wishful thinking) can really change that basic equation. You can pretend magic exists, but that doesn’t make it so. You aren’t fooling anyone. You just look childish.

When software development is recognized as work, there can be clarity about what is possible with a given set of resources. Then tasks can be sanely identified, specified, prioritized, coordinated, scheduled, executed, completed.

And then some really cool things can happen. Not magical things, but really cool things. Great things, even. The kind of great things that result from understanding, dedication, and hard work.