How do we prove our ETL processes are correct? How do we make sure upstream changes don't break our processes and break our beautiful data?

ETL Testing Part 2 - Operational Data Testing

This is the second part of a series on ETL testing, the first part explained about unit testing, and in this part, we will talk about how we can prove the correctness of the actual data, both today and in the future after every ETL run.

Testing ETL processes is a multi-layered beast, we need to understand the different types of test, what they do for us, and how to actually implement them. The tests we are talking about here are data tests or operational testing, we are going to actually look at the real data that was produced by an ETL run and see whether the data is as we expect it to be.

There are three ways we can implement tests for our data:

  • Constraints
  • Reconciliation
  • Auditing

Constraints

If our source or target are databases that support constraints, then these are a handy way to ensure that the data we have matches our expectations. If we expect a column never to be null, then we can use a non-null constraint. If we assume a column or set of columns to be unique, we can enforce a unique key on the column or columns.

We can use foreign keys to ensure that only known values are loaded, some systems also allow us to have check constraints to validate the data further, we can have a check constraint to specify that a value should fall within a set of known values or has some other property.

Constraints are excellent where they are available. However, they can be slow in some cases, and so the cost of enforcing the constraints may be too prohibitive. Personally, I would also try to use constraints where available and remove them if the load was too slow. Knowing for sure that your data is persisted in a way that guarantees integrity is a compelling motivator for absorbing the performance overhead.

Today data is not always persisted in relational databases with first-class constraint support or not even in an RDBMS so it may be that we can’t use constraints or the implementation is not ideal, so we can choose not to use them.

Where we don’t have a good implementation of constraints that we can use, we can still define our own constraints in code, a unique check can be expressed in SQL such as:

select column1, column2, count(*) from
table
group by column1, column2
having count(*) > 1

Or in python:

pd.Series(list).is_unique

Implementing constraints yourself can be quite expensive, so you need to consider what and how you test. I have always been a firm believer that if something is important and can affect the decisions that a business makes, then it is worth testing it. I am less likely to implement any constraints for the lunch-time ping-pong rota than a medical system that determines whether someone needs an extra (potentially fatal) dose of a medicine.

Finally, constraints are beneficial for ETL processes as often, upstream systems change without the poor ETL developers getting any notice and then a couple of things can happen, firstly your data can be wrong but silently succeed and you can believe that you have useful data, but in reality, it can be wrong. The second is that you might get a failure at some point, which was caused by data earlier in the process being incorrect.

If, when you bring your data from your source systems, and you apply constraints, then you can catch changes that have fundamentally changed the way the data behaves and finding incorrect data early on in your process also helps you to troubleshoot why ETL processes are broken.

Reconcilliation

Reconciliation is the process of going to your source system, getting a number and validating that number on the target. This ranges from being easy to impossible, so you need to decide what to reconcile on a case by case basis.

In its simplest form, we can go to a source system and find out things like how many records are to be copied, sum up totals and run other aggregations that we can then validate as correct (or not!) on the target system.

If the source system is something like a web service perhaps you will need to do something like extract a batch of data and write it to a file before loading, this file is probably going to be the source of your reconciliation. Undoubtedly if your source system is an RDBMS or a web service that accepts queries, this is the most straightforward approach. Otherwise, you will need to do something like parse the file and gather the data you need.

What I like about reconciliation checks is that they really are a way to prove correctness in a way things like unit tests really can’t. Knowing that a source system reported sales in a day of X million pounds and the target system also says the same numbers is so critical and, typically not too hard to implement. The confidence that having equal value for critical metrics across the system is hard to beat. Having a reconciliation check before completing the ETL process gives you a chance to fail the process and find the error before the business does. If you have a severe mistake in a critical metric, it isn’t normally long before the business notice, and if they keep finding errors, it undermines your whole ETL process.

Being pro-active with data issues is key to building a reliable ETL pipeline that the business trusts.

Auditing

Auditing is the process of saving what you did and when and then using that data to see if something bad has gone wrong.

What I mean by this is when you do things like pull data from a source system, you save a count of how many records. You maybe save some details of the data you pulled such as “the data was broken down to x million rows with the category a, x million rows with the category b, etc.”, when you join data sources you save how many rows were in each data source and how many rows you ended up with. When you write data, you save how many rows you have written.

Once you have all this data (in a computer-readable format), then you can start to do things like verifying that because you read x rows that x rows have been updated and inserted into the target. You can also build things like trend lines to show that typically on a day you read x million rows, with x per cent being one specific category and if the values fall our of a range you can fire off alerts.

Even without alerts, you can chart how much data you are processing and how long the process takes to get some idea if you need to start planning for future growth.

The key with auditing is instead of adding something like these lines to a log file:

Read 53,435 rows from source system blah
Writing data to target system "blah", updated 53,435 rows

Write the data in a way that is easy to parse, ideally in a format where the counts and descriptions are separate (think JSON, CSV, RDBMS table).

Done

Hopefully, that has helped explain how we can start thinking about testing our ETL pipelines and data in a way that proves correctness.

Where we have unit testing that is great for helping us write code fast and to document the way the system works (or at least what the developer thinks the system should be doing) but poor for proving whether values are correct. This sort of operational testing is entirely on the other end of the scale, it doesn’t help us write code faster (if anything it makes us slower) but it does prove correctness. In the next post in this series, we will look at integration testing which is slap bang in the middle of these types of testing, it speeds us up a bit and proves correctness a bit - exciting!

If you do any sort of operation testing that isn’t listed here, then I would love to hear what you do, either the comments below or email etc.