How do test the upstream data is good in an etl process (etl testing part 3)?
In the first two parts of this series (https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ and https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/), I talked about how to unit test your business logic and integration test your ETL infrastructure code. Having these tests ensures that your code is in order, it means you have documented, and future-proofed your code which is a fantastic thing to have. What testing our code doesn't give us is a way to validate the data we receive is correct.
Code has been validating its own parameters since the dawn of bugs, but an ETL process is like a million parameters per second that all need validating because upstream changes can and will affect you at some point.
It isn't a case of maybe upstream changes will happen, it is a case of when. I am generally not a massive defensive programming fan, sure I am not reckless with my inputs, but I err on the side of other systems doing what they say they will, catching any errors and failing if anything goes wrong. With ETL testing, I know upstream data will be incorrect at some time, so include checks for that as part of the pipeline.
What do you need to test?
So this really depends on your data, each pipeline is different but what I like to do is to break the ETL process into a set of rules. You will have one or more input sources, and each will have one or more attribute and one or more records. You will also have one or more destination, which will have one or more attribute and one or more record. That is the only crucial thing for an ETL (or ELT whatever) process is how and where does your data come in and then, how and where does the data go out. The middle bit matters in that we might do some transformation but ultimately what do we get in, and what do we get out?
If we start to think of the process as a black box, then we can begin to think about what data we will receive, what is valid data and what is invalid data, for each data source, each attribute, each record. There isn't an easy way to bypass this, if you don't think about what data you are going to get in, you won't be able to work out what tests you should have.
A quick aside to answer the statement “yes, but, we don't know what data we will get in, we just need to process it and show it to users.”
Ok, this is a pretty standard response, you are not the first person to be in this boat and what you have isn't something that you can build a reliable set of tests against and so lets hope your users don't use your data to do something dangerous like financial reporting or prescribing medicine on the basis of this data.
It is ok to have data that you have no control of, and you just need to get in front of users but be clear with them that the data you receive might be incomplete or completely wrong - please don't give users a sense that the data is accurate because it probably isn’t.
If you genuinely cannot sit down and say we expect this data source to have these attributes which can fall into these ranges then you can't expect to be able to load the data with any certainty, fix the root cause of the issue and add tests or don't fix the root cause but be very clear about the accuracy of the data.
Back to it - writing the rules!
So we have one or more datasets we need to read, we have a list of attributes, and we have a list of possible values for each attribute. You probably won't be given this level of detail, you will likely have to study the data and figure it out yourself and come up with your own rules. Sometimes your own rules will be wrong, but that is fine if you have a failure in production and the rule you created was incorrect you can figure out what the rule should be and change it.
The goal of thinking about the rules should be to come up with a list of these answers for each attribute:
- Can it be null?
- What data type is it?
- What can the min/max it can be?
- Is there a known range the data can be in?
- If an attribute is one particular value, does that restrict what other attributes can be?
This is a guide, only you will know what makes sense for your data, you will need to look at the data and any documentation you have and figure out the best approach.
So armed with the knowledge of what data you expect to receive, you can start formalising that in your ETL pipeline, how exciting.
Writing the actual rules
There are a couple of approaches to writing these data quality rules, the first is to validate the data in its source, which depends on what the source is. If the source is an RDBMS maybe you can get these rules added as constraints to the database if you have control over the source system. If you don't have control over the source, then you could run SQL queries over the data to make sure it matched your expectations. Maybe you can bring the data into a table in your database without any constraints, run a set of queries for each rule there and report any errors at that stage.
If the data isn't in an RDBMS, maybe a flat file or a web service call which you save to a JSON file, you still have options, for one you can use something like Apache Drill to run SQL queries against the data or you can bring the data into a loading table without any constraints and run your queries there.
The key about the first method is being able to run queries against your data to validate that your rules are met:
select count(*) from data where non_null_column is null; select count(*) from data where try_cast(int_column) is not null;
That is it really, make it so it is easy to add and remove tests and it is easy to see what has gone wrong when something has gone wrong - do things to help such as include row numbers where you get a failure or include the actual value that failed when there is a failure.
You will likely want to write a framework to deal with the tests but make lightweight - this is really important, any framework for this should be in the order of hundreds of lines of code rather than thousands or god forbid hundreds of thousands (I've seen it aggghhhhhh!). Keep it simple, find some way to run your queries, save the output and use that to monitor the quality of your incoming data. If you really are struggling to be able to write a framework to run tests contact me and I'll help you, either just some advice or engage me as a consultant, I'll get you up and running.
The second approach is to use the constraints that RDBMS's have, things like a non-null constraint or check constraints. I am a firm believer that these have a place in every database, but I generally load data into tables without any formal constraints and then load the data into tables that do have constraints. The reason I don't have constraints on my loading tables isn't because of performance, it is because sometimes the data doesn't meet those constraints and typically the entire batch (or subset of the batch) will be thrown out, and it is hard to know what has failed.
The only way the second approach can really work is if you are loading records one at a time which is unlikely in an ETL process.
There you go, the third way to test your ETL processes is to identify the critical rules for your data and then write queries to ensure that the data does match your rules.
You will find three things, firstly you miss some rules which you can add in when they occur, secondly you have some rules that catch problems that are false positives, you can loosen those rules as they occur and thirdly you have a much better grasp of when there is a problem and be able to diagnose it significantly faster than when you know (or didn't know) that there was a problem with the upstream data.