Validating upstream data quality in ETL processes, SQL edition

It is a non-null constraint, not a non-ish-null constraint

You are writing an ETL process, part of this process you need to import a semi-structured file (think CSV, JSON, XM-bleurgh-L, etc.) when you import the data into an RDBMS you get all sorts of things that make schema designers excited like unique constraints and check constraints. The problem is that the file you are importing is from another system and all “other” systems in the world make mistakes and changes and send you duff data that won’t work with your lovely constraints. The problem we have is that RDBMS’s apply to entire column at one time. This columnar approach to constraints means that, although you might have nine hundred and ninety nine thousand rows that will pass the constraints, if you have even one, single solitary, row that fails then the whole constraint fails. It is a non-null constraint, not a non-ish-null constraint

The simplest way to deal with this is to import all the data into a table without any constraints, some people go as far as just using varchar or nvarchar for every column, something like this for this csv:

first_name, second_name, dob, class
ed, elliott, 01/01/1911, top
bert, bigglesworth,, bottom
englebert, mcmahon,21/12/2001,middle
ed, elliott, 01/01/1911, top

CREATE TABLE FILE_STAGING(
    FILE_IMPORT_ID INT IDENTITY(1,1) NOT NULL,
    FIRST_NAME NVARCHAR(MAX) NULL,
    SECOND_NAME NVARCHAR(MAX) NULL,
    DOB NVARCHAR(MAX) NULL,
    CLASS NVARCHAR(MAX) NULL,
)

Then the data is imported and we can start to see what sort of quality we have. There are a couple of approaches to get the data from this staging table into something that Edgar F. Codd would be happier with. The first is we can write a massive SQL statement that does everything in one step:

SELECT DISTINCT FIRST_NAME, SECOND_NAME, CAST(DOB AS DATE) AS DOB, CLASS 
    FROM FILE_STAGING
    WHERE 
        (FIRST_NAME IS NOT NULL AND SECOND_NAME IS NOT NULL AND DOB IS NOT NULL AND CLASS IS NOT NULL)
        AND
        (TRY_CONVERT(date, DOB) IS NOT NULL)
        AND
        (CLASS IN ('top', 'middle', 'bottom'))

We can select those rows which we want into our real table which will either already have constraints enabled or we can enable the constraints after we load the data. There are a few problems with this approach, the first problem is that this isn’t very easy to debug. When you realise that your load process only loaded half of the expected rows, why didn’t it load the rest? Other problems include, what do we do with any failed rows - i.e. which rows failed? To answer this we need another query that is the reverse of the “get good rows”. Finally, this is quite a lot of SQL for a file with four columns and three rules, what if the rules are more complex and our data file has lots of columns? I tell you what happens if the rules are more complex: “a mess of SQL jank”, that’s what.

So what instead? I’ll tell you what I like to do, I like to keep a running tally of each rule we need to apply to the data, with the data so anyone can get to it and we can use the awesomeness of SQL to pull the good data, troubleshoot why rows aren’t coming across and generally be smug with our own awesomeness:

CREATE TABLE FILE_STAGING(
    FILE_IMPORT_ID INT IDENTITY(1,1) NOT NULL,
    FIRST_NAME NVARCHAR(MAX) NULL,
    SECOND_NAME NVARCHAR(MAX) NULL,
    DOB NVARCHAR(MAX) NULL,
    CLASS NVARCHAR(MAX) NULL,
    FIRST_NAME_IS_NULL AS (CASE ISNULL(FIRST_NAME, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    SECOND_NAME_IS_NULL AS (CASE ISNULL(SECOND_NAME, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    DOB_IS_NULL AS (CASE ISNULL(DOB, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    CLASS_IS_NULL AS (CASE ISNULL(CLASS, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    DOB_IS_NOT_DATE AS (CASE ISNULL(TRY_CONVERT(DATE, DOB), '1900-01-01') WHEN '1900-01-01' THEN 1 ELSE 0 END),
    CLASS_IS_INVALID AS ( CASE CLASS WHEN 'top' THEN 0 WHEN 'middle' THEN 0 WHEN 'bottom' THEN 0 ELSE 1 END ),

    ANY_ERRORS INT NULL 
)

You see that “ANY_ERRORS” at the end? In MS SQL we can’t reference a computed column from another computed column so we need to do an update ourselves:

UPDATE FILE_STAGING
    SET ANY_ERRORS = 1
WHERE (FIRST_NAME_IS_NULL + SECOND_NAME_IS_NULL + DOB_IS_NULL + CLASS_IS_NULL + DOB_IS_NOT_DATE + CLASS_IS_INVALID) > 0

We could just include the where with our select but why make life harder for ourselves, saving a millisecond of processing time and making it harder for ourselves to debug it is a waste of time and money:

INSERT INTO DESTINATION_TABLE
SELECT * FROM FILE_STAGING 
    WHERE
    (FIRST_NAME_IS_NULL + SECOND_NAME_IS_NULL + DOB_IS_NULL + CLASS_IS_NULL + DOB_IS_NOT_DATE + CLASS_IS_INVALID) > 0

Ok, maybe you could create a view or something, i’m not your mum do what you want.

Now remember to be flexible, in some cases such as where we have a lot of possible values a column could be, we might want to have a separate lookup rather than a long ugly constraint:

UPDATE FILE_STAGING 
    SET CLASS_IS_INVALID = 1 
WHERE
    CLASS NOT IN (SELECT CLASS FROM VALID_CLASSES)

and use a default on the original table for that column:

CREATE TABLE FILE_STAGING(
    FILE_IMPORT_ID INT IDENTITY(1,1) NOT NULL,
    FIRST_NAME NVARCHAR(MAX) NULL,
    SECOND_NAME NVARCHAR(MAX) NULL,
    DOB NVARCHAR(MAX) NULL,
    CLASS NVARCHAR(MAX) NULL,
    FIRST_NAME_IS_NULL AS (CASE ISNULL(FIRST_NAME, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    SECOND_NAME_IS_NULL AS (CASE ISNULL(SECOND_NAME, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    DOB_IS_NULL AS (CASE ISNULL(DOB, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    CLASS_IS_NULL AS (CASE ISNULL(CLASS, N'ISNULL') WHEN N'ISNULL' THEN 1 ELSE 0 END),
    DOB_IS_NOT_DATE AS (CASE ISNULL(TRY_CONVERT(DATE, DOB), '1900-01-01') WHEN '1900-01-01' THEN 1 ELSE 0 END),
    /*  Look Here VV     */
    CLASS_IS_INVALID INT NOT NULL DEFAULT (0),

    ANY_ERRORS INT NULL 
)

Whether we default to the class being valid or invalid and then update valid or invalid depends on whether we normally expect them to be valid not not, generally try to update as few rows as possible but unless you are dealing with unreasonably large amounts of data, it probably doesn’t matter.

There you go, when writing data import code from semi-structured into something more structured here are some approaches to make the data a little bit easier to debug, hope it helps and would love to hear what you do!