Ed Elliott's Agile SQL Club

Validating upstream data quality in ETL processes, SQL edition

It is a non-null constraint, not a non-ish-null constraint You are writing an ETL process, part of this process you need to import a semi-structured file (think CSV, JSON, XM-bleurgh-L, etc.) when you import the data into an RDBMS you get all sorts of things that make schema designers excited like unique constraints and check constraints. The problem is that the file you are importing is from another system and all “other” systems in the world make mistakes and changes and send you duff data that won’t work with your lovely constraints.

SQLCover 0.5 - Fixes, smaller features and an exciting surprise

It has been a little while but I have updated SQLCover to include a number of fixes and small features, the majority of which are improvements to the html output: For full details and to download the latest version see: https://github.com/GoEddie/SQLCover/releases/tag/0.5.0 or https://www.nuget.org/packages/GOEddie.SQLCover/0.5.0 If you get any issues please comment below or raise an issue on github. Highlights Cobertura Cobertura is a format for code coverage tools, Azure DevOps supports cobertura files to display code coverage results alongside the build so this is a really nice thing to be able to have, if you use SQLCover in your Azure DevOps builds (or any ci server that supports Cobertura files) then you can use the Cobertura output to generate this:

How to test ETL Processes in production

This is the final part in the four-part series into testing ETL pipelines, how exciting! Part 1 - Unit Testing https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ Part 2 - Integration Testing https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/ Part 3 - Validating the upstream data https://the.agilesql.club/2019/09/how-do-test-the-upstream-data-is-good-in-an-etl-process-etl-testing-part-3/ This final part is the last step, you have documented your business logic with unit tests, you have validated your pipeline with sample data (good and bad data), you have a step in your pipeline to ensure the upstream data meets your expectations and you have deployed the code to production where, AND ONLY where, you can be confident the code works.

How to connect spark to ms sql server without '[Error] [JvmBridge] java.sql.SQLException: No suitable driver'?

“[Error] [JvmBridge] java.sql.SQLException: No suitable driver” - unable to connect spark to Microsoft SQL Server. In spark when you want to connect to a database you use Read() passing in the format “jdbc” and include the options of url, driver and either dbtable or query. DataFrame dataFrame = spark.Read() .Format("jdbc") .Option("url","jdbc:sqlserver://localhost;databaseName=dName;") .Option("user", "user_name") .Option("password", "password or secret") .Option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .Option("dbtable", "schema.table_name") .Load(); The url tells jdbc that we want to connect to sqlserver (jdbc:sqlserver) and then the details of the server to connect to.

How do test the upstream data is good in an etl process (etl testing part 3)?

In the first two parts of this series (https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ and https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/), I talked about how to unit test your business logic and integration test your ETL infrastructure code. Having these tests ensures that your code is in order, it means you have documented, and future-proofed your code which is a fantastic thing to have. What testing our code doesn’t give us is a way to validate the data we receive is correct.

Visual Studio tSQLt Test Adapter

I finally got around to updating the tSQLt test adapter for visual studio, you can download it from: https://marketplace.visualstudio.com/items?itemName=vs-publisher-263684.GoEddietSQLt2019 or the search in visual studio extensions thingy finds it as well. For details on what this is and how it works see the original post: https://the.agilesql.club/2016/08/tsqlt-visual-studio-test-adapter/

Help! My Apache Spark configuration on windows isn't working (I think)

Steps needed Getting Apache Spark running on windows involves: Installing a JRE 8 (Java 1.8/OpenJDK 8) Downloading and extracting SPARK and setting SPARK_HOME Downloading winutils.exe and setting HADOOP_HOME If using the dotnet driver also downloading the Microsoft.Spark.Worker and setting DOTNET_WORKER_DIR if you are going to use UDF’s Making sure java and %SPARK_HOME%\bin are on your path There are some pretty common mistakes people make (myself included!), most common I have seen recently have been having a semi-colon in JAVA_HOME/SPARK_HOME/HADOOP_HOME or having HADOOP_HOME not point to a directory with a bin folder which contains winutils.

How do you debug your spark-dotnet app in visual studio?

When you run an application using spark-dotnet, to launch the application you need to use spark-submit to start a java virtual machine which starts the spark-dotnet driver which then runs your program so that leaves us a problem, how to write our programs in visual studio and press f5 to debug? There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached - think apps which spawn other processes and they fail in the startup routine.

How do we prove our ETL processes are correct? How do we make sure upstream changes don't break our processes and break our beautiful data?

ETL Testing Part 2 - Operational Data Testing This is the second part of a series on ETL testing, the first part explained about unit testing, and in this part, we will talk about how we can prove the correctness of the actual data, both today and in the future after every ETL run. Testing ETL processes is a multi-layered beast, we need to understand the different types of test, what they do for us, and how to actually implement them.

spark-dotnet how to manually break a file into rows and columns

I found this question on stack overflow that went something like this: “I have a file that includes line endings in the wrong place and I need to parse the text manually into rows” (https://stackoverflow.com/questions/57294619/read-a-textfile-of-fixed-length-with-newline-as-one-of-attribute-value-into-a-ja/57317527). I thought it would be interesting to implement this with what we have available today in spark-dotnet. The thing is though that even though this is possible in spark-dotnet or the other versions of spark, I would pre-process the file in something else and by the time spark reads the file have it already in a suitable format.