How to test ETL Processes in production

This is the final part in the four-part series into testing ETL pipelines, how exciting!

This final part is the last step, you have documented your business logic with unit tests, you have validated your pipeline with sample data (good and bad data), you have a step in your pipeline to ensure the upstream data meets your expectations and you have deployed the code to production where, AND ONLY where, you can be confident the code works. Now you have code in production you need to do the final type of testing, “testing in production”.

I call this “testing in production”, that is you are continuously testing that your code in production is working correctly and there is a massive distinction between that and the other type of production testing “testing on production”. “Testing on production” is terrible, that is where you don’t have tests, and you run it in production to see if it works - don’t do that.

If you imagine you have unit tests (part 1) which test one line of code or one small set of lines of code work or at least do what the developer wanted them to do - these unit tests are good and useful because they show the developers intent, how did well did the understand the requirements etc., but they don’t prove whether the pipeline works in production.

Then we have integration tests (part 2) which start to show that specific data sources can be read, transformed and loaded - these are good because each test shows that different parts of the process work, maybe one source is read in, one column transformed and one specific join is, well joined, correctly.

We then have rules in our pipeline to show that the data that we try to read meets a set of rules and expectations we have about the data (part 3). This means that when we process a pipeline and we read the data that we can be confident that the code we have written will be able to process the data we have.

Now we have a certain level of confidence, but until we actually run the code in production we will never be 100% confident, even with the most rigorous testing there will always be something that can occur, what we need for 100% confidence is to see it running and see the effects of a pipeline that runs and the effects of the pipeline as it continues every run. It is fine manually watching a pipeline once and declaring it a success, but how do you know it works and keeps working from now until the time it is turned off?

There are two parts to this answer, the first is monitoring, the second thing we can do is to roll out ETL pipelines slowly and side-by-side using phased deployments.

Monitoring

What we need to do is to build in monitoring, we need to capture metrics and report on those metrics, we need to create dashboards and look for trends to show when things have gone wrong and have alerting for when things really go wrong.

What do we need to know?

At a minimum we need to know:

  • how many records you read
  • how many records you write
  • how long the process and ideally each step takes
  • how many failed records you have
  • summaries of essential metrics that you process (total sales for example)
  • any errors or exceptions and ideally failed records should be stored somewhere to be dealt with later

How you capture these metrics depends on which tooling you are using for your ETL process unless you are writing your own framework there will be a way to capture this data if there isn’t a built-in way to capture it then write it to your own logging database or system.

If you use something like Azure Log Analytics, Elasticsearch, Datadog, whatever, for your application logging, then these can be good choices for your ETL pipeline logging. I would always recommend using whatever systems your other developers use. This way, you can piggy-back on the work of others.

What do we do with the data?

You keep it, you keep it a long time, and you use it to:

  • create alerts when things are wrong
  • create charts showing when things are trending towards wrong (getting slower and slower)
  • refer back to when things are going wrong so you can see what is normal or not

If you have a list of previous runs, how long it took and how many records they processed, you can do things like say “the process is still running, but when it had the same number of records last week it was much quicker so there must be something else wrong”, or more helpfully “when it has this number of records, it is expected to take x minutes and we are x - 10 minutes into the process so if it isn’t finished in 10 minutes there might be a problem”.

There is nothing worse than trying to troubleshoot something and not having any data to know what is normal and what is abnormal - I have wasted a significant portion of my life because there has either been a partial or total lack of historical data which would have helped an issue.

Having monitoring and alerting and being able to see if your pipelines are working efficiently is the only way to know, for sure, that your pipelines are working correctly in production. You can test for now until eternity but until you are in production you are can never be guaranteed that they will work.

Phased deployments

I first heard about these with visual studio online (which became vsts which became azure devops, which will become …. Tbd) and what they did was have different rings, internal teams were on ring 0, friendly external teams ring 1, less friendly external teams than everyone else. When they did a release they would release to ring 0, monitor and watched to see if anyone started crying, if not one started crying then onto ring 1, monitor and see if anyone started crying etc. This meant they could roll out changes and if there was a problem not carry on to the next ring that would cause actual customers to get upset.

Now to do this is quite complicated, you need different segregated environments and to split your customers (internal and external) into friendly and unfriendly segments.

Now if we have an ETL pipeline we should think about how we can get our code into production in phases, how can we make changes to the code we have without breaking the old code and seeing if the new code works?

First, we need to address the fact that an ETL pipeline can be different things to different people, for example an ETL pipeline could be any of these (and more):

  • CSV file to database
  • CSV file to parquet
  • web service to parquet
  • JSON to excel

If we are writing to a database maybe we can leave the old pipeline and write to a version 2 of a table, then let users see the data and run the old and new version side by side and after a while turn off the old pipeline. That might work in some situations. Using views or synonyms is often a great way to be able to flip different versions of things in for different people - RDBMS’s have some cool features for modern devops even though they aren’t modern!

If you are writing to an output file can you do something like writing to a version 2 of a file, so you can export the old and new files side-by-side until your users can validate that everything is okay.

The good thing about running both processes side-by-side is that you can do a lovely bit of reconciliation between the two processes. So there you have it, roll out your code slowly and find a way to be able to get it into production without affecting everyone, find some friendly users who are eager for the latest and greatest and don’t mind when things don’t work perfectly.

Done

I hope that this four part series has given you a good enough overview of how to test your ETL processes in a way that means you can deploy changes quicker and with less problems than ETL processes without tests. With a good suite of tests you have a level of confidence that means you can just move faster, you can take a reuest from the business and get it out there as soon as possible. There is no better way to develop than being able to check-code in and having a set of tests tell you that you are good to go. Fearless development begins and ends with tests.