If you write code to SQL Server then you might be interested in this: I have written a tSQLt tdd training course which has helped over 500 people learn both tSQLt and how to apply TDD practices to their SQL Server T-SQL development, you can join the course at https://courses.agilesql.club.
Every ETL pipeline is only ever as reliable as the data that the upstream system provides. It is inevitable that assumptions you make about the data you are provided will be shattered and there is absolutely nothing you can do about it.
So what can we do? Do we just accept that our pipelines will break and fix them when the CEO shouts that the figures are out or even worse if no one notices and the data is wrong for months or years? No, of course we don't! What we do us use testing to help us.
There are two things we need to do, the first is make sure that we have a good set of unit tests, integration tests and acceptance tests so that we can be certain our code is right, that we can document our code and we can make changes to our code as quick as possible. The second thing is that we have monitoring and data quality checks so that we validate our assumptions AND when something goes wrong alert to tell us that there is something wrong, rather than silently succeeding.
Business Logic vs Infrastructure Code
When you look at testing ETL pipelines you really have to think about the code, what it does and what you need to be testing. If you think of an ETL pipeline, the process goes something like this:
- Copy a file from a to b
- Read the file from b
- Join the file with c, d and f
- Write the joined data to ZZ
- Read from ZZ, aggregate in ZYX
- Write ZYX to some where else
If we think about this then there are two types of code, the business logic that does the aggregation to ZYX and the business logic that does the join of b to c, d and f. The rest of the code is infrastructure code:
- Copy a file
- Read a file
- Write a file
- Write somewhere else
The business logic needs testing, needs validating and the infrastructure code needs running to validate it. We are more than likely going to be using an ETL system to be doing the actual work of the infrastructure code - we are just configuring a task withing that ETL framework.
The first type of test is unit tests, these validate that any business logic we have is correct - the key point here is that unit tests validate one bit of business logic works in one specific way. A unit test doesn't check whether any infrastructure code works or whether an entire ETL pipeline works. In an ETL pipeline for unit testing to be viable we need to separate the business logic from the infrastructire code. If we are using Airflow and python then we can unit test python functions, if we are using SSIS or ADF then we need to put our business logic in stored procedures (or maybe at a push .net libraries and call from SSIS but that is a separate world of pain).
A good set of unit tests will:
- Document how well the developer understood the requirements when they wrote the code
- Explain how any business logic works
- Give living examples of how the business logic should be called
- Validate that a feature of a function behaves in a known state
A good set of unit tests will NOT:
- Give any confidence that the ETL pipeline will work in production
The key to unit testing is understanding that you have some business logic and you are running that logic with the expected inputs to see how the business logic behaves, you know you are doing unit testing correctly when you unit test using 0 rows and just configure the rowes you need for each tests and you insert a handful of rows for each test.
This is probably the place to start if you have no tests already. An integration test:
- Deletes all the data in the source and target
- Inserts data into the source and maybe the target
- Runs the ETL tool or process
- Validates that the tool or process was a success
- Validates an logging to make sure there are no errors or failures and there are successes
- Validate the output data to ensure the correct data was written
A good set of integration tests will:
- Document how one entity or file or specific process is processed by the ETL pipeline
- Prove that a single entity or process can be handled correctly by the ETL pipeline
- Show any exception data is handled correctly
- Start to give some confidence that the ETL pipeline will work
A good set of integration tests will NOT:
- Prove that an entire ETL process will work in production
- Connect to third party systems - we will create our own dummy versions
This is likely a mirror of production or even a subset of production, think deploying to a set of friendly customers or dog fooding your own data. The full ETL pipelines run, probably connecting to third party systems and the data that is written is like production (if not production) and the business users can see their dashboards running and be confident in the ETL process.
A good set of acceptance tests will:
- Give a high level of confidence (not 100%, nothing but production will be 100%) but depending how much effort we put in, not far from 100% confidence the ETL process will work in production
- Show data flowing end-to-end with pretty dashboards at the end of the process
A good set of acceptance tests will NOT:
- Give 100% confidence of the ETL pipeline working in production
When we are in production, when our ETL pipeline has completed and the results have been pushed to our central logging / monitroing solution the and only then can we have 100% confidence that the pipeline is good.
When we do receive that data from an upstream system that breaks our pipeline then we need to have checks like constraints (there are always was to use self-made constraints if we aren't writing to an RDBMS) and validate that our assumptions are correct, such as:
- Checking non-null columns for nulls
- Checking sales amounts tally across entities
- Checking the volume of sales is withing a tolerance
- Checking data types
Add enough checks to your pipeline so that, when, not if but when you get bad data you can quickly identity the issue, quickly fix and get the fix deployed back out to production. You will never be faster at doing this that with a good set of automated tests.
For more on ETL testing, I have slides from talks and a four part blog series:
If you need help database testing or ETL testing please fell free to reach out to me firstname.lastname@example.org