Every ETL pipeline is only ever as reliable as the data that the upstream system provides. It is inevitable that assumptions you make about the data you are provided will be shattered and there is absolutely nothing you can do about it. So what can we do? Do we just accept that our pipelines will break and fix them when the CEO shouts that the figures are out or even worse if no one notices and the data is wrong for months or years?
When we use ADF to call Databricks we can pass parameters, nice. When we finish running the Databricks notebook we often want to return something back to ADF so ADF can do something with it. Think that Databricks might create a file with 100 rows in (actually big data 1,000 rows) and we then might want to move that file or write a log entry to say that 1,000 rows have been written.
When you use delta lake there are a couple of interesting things to note based around the fact that the data is stored in parquet files which are read-only and delta lake includes the ability to delete and update data and view the state of a table at a specific point in time. Obviously read-only and updates and deletes don’t exactly sound like they work together, so how does it all work and what do we need to be aware of?
It is a non-null constraint, not a non-ish-null constraint You are writing an ETL process, part of this process you need to import a semi-structured file (think CSV, JSON, XM-bleurgh-L, etc.) when you import the data into an RDBMS you get all sorts of things that make schema designers excited like unique constraints and check constraints. The problem is that the file you are importing is from another system and all “other” systems in the world make mistakes and changes and send you duff data that won’t work with your lovely constraints.
It has been a little while but I have updated SQLCover to include a number of fixes and small features, the majority of which are improvements to the html output: For full details and to download the latest version see: https://github.com/GoEddie/SQLCover/releases/tag/0.5.0 or https://www.nuget.org/packages/GOEddie.SQLCover/0.5.0 If you get any issues please comment below or raise an issue on github. Highlights Cobertura Cobertura is a format for code coverage tools, Azure DevOps supports cobertura files to display code coverage results alongside the build so this is a really nice thing to be able to have, if you use SQLCover in your Azure DevOps builds (or any ci server that supports Cobertura files) then you can use the Cobertura output to generate this:
This is the final part in the four-part series into testing ETL pipelines, how exciting! Part 1 - Unit Testing https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ Part 2 - Integration Testing https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/ Part 3 - Validating the upstream data https://the.agilesql.club/2019/09/how-do-test-the-upstream-data-is-good-in-an-etl-process-etl-testing-part-3/ This final part is the last step, you have documented your business logic with unit tests, you have validated your pipeline with sample data (good and bad data), you have a step in your pipeline to ensure the upstream data meets your expectations and you have deployed the code to production where, AND ONLY where, you can be confident the code works.
“[Error] [JvmBridge] java.sql.SQLException: No suitable driver” - unable to connect spark to Microsoft SQL Server. In spark when you want to connect to a database you use Read() passing in the format “jdbc” and include the options of url, driver and either dbtable or query. DataFrame dataFrame = spark.Read() .Format("jdbc") .Option("url","jdbc:sqlserver://localhost;databaseName=dName;") .Option("user", "user_name") .Option("password", "password or secret") .Option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .Option("dbtable", "schema.table_name") .Load(); The url tells jdbc that we want to connect to sqlserver (jdbc:sqlserver) and then the details of the server to connect to.
In the first two parts of this series (https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ and https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/), I talked about how to unit test your business logic and integration test your ETL infrastructure code. Having these tests ensures that your code is in order, it means you have documented, and future-proofed your code which is a fantastic thing to have. What testing our code doesn’t give us is a way to validate the data we receive is correct.
I finally got around to updating the tSQLt test adapter for visual studio, you can download it from: https://marketplace.visualstudio.com/items?itemName=vs-publisher-263684.GoEddietSQLt2019 or the search in visual studio extensions thingy finds it as well. For details on what this is and how it works see the original post: https://the.agilesql.club/2016/08/tsqlt-visual-studio-test-adapter/
Steps needed Getting Apache Spark running on windows involves: Installing a JRE 8 (Java 1.8/OpenJDK 8) Downloading and extracting SPARK and setting SPARK_HOME Downloading winutils.exe and setting HADOOP_HOME If using the dotnet driver also downloading the Microsoft.Spark.Worker and setting DOTNET_WORKER_DIR if you are going to use UDF’s Making sure java and %SPARK_HOME%\bin are on your path There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.