TF-IDF in .NET for Apache Spark Using Spark ML

Last Updated: 2020-10-18 NOTE: What you read here was before .NET for Apache Spark 1.0 which includes everything we need to do this purely in .NET - in this post you will see an example that is no longer necessary for TF-IDF, instead view: https://the.agilesql.club/2020/12/spark-dotnet-tf-idf. Spark ML in .NET for Apache Spark Spark is awesome, .NET is awesome, machine learning (ML) is awesome, so what could be better than using .

Approaches to running Databricks ETL code from Azure ADF

Databricks is fantastic, but there is a small issue with how people use it. The problem is that Databricks is all things to all people. Data scientists and data analysts use Databricks to explore their data and write cool things. ML engineers use it to get their models to execute somewhere. Meanwhile, the cool kids (data engineers obviously) use it to run their ETL code. Some use cases favour instant access to “all the datas”, some favour rigorous engineering discipline so when we look at Databricks it is a case of one size does not fit all.

The four tenets of ETL testing

Every ETL pipeline is only ever as reliable as the data that the upstream system provides. It is inevitable that assumptions you make about the data you are provided will be shattered and there is absolutely nothing you can do about it. So what can we do? Do we just accept that our pipelines will break and fix them when the CEO shouts that the figures are out or even worse if no one notices and the data is wrong for months or years?

Passing status messages and results back from Databricks to ADF

When we use ADF to call Databricks we can pass parameters, nice. When we finish running the Databricks notebook we often want to return something back to ADF so ADF can do something with it. Think that Databricks might create a file with 100 rows in (actually big data 1,000 rows) and we then might want to move that file or write a log entry to say that 1,000 rows have been written.

Spark Delta Lake, Updates, Deletes and Time Travel

When you use delta lake there are a couple of interesting things to note based around the fact that the data is stored in parquet files which are read-only and delta lake includes the ability to delete and update data and view the state of a table at a specific point in time. Obviously read-only and updates and deletes don't exactly sound like they work together, so how does it all work and what do we need to be aware of?

Validating upstream data quality in ETL processes, SQL edition

It is a non-null constraint, not a non-ish-null constraint You are writing an ETL process, part of this process you need to import a semi-structured file (think CSV, JSON, XM-bleurgh-L, etc.) when you import the data into an RDBMS you get all sorts of things that make schema designers excited like unique constraints and check constraints. The problem is that the file you are importing is from another system and all “other” systems in the world make mistakes and changes and send you duff data that won't work with your lovely constraints.

SQLCover 0.5 - Fixes, smaller features and an exciting surprise

It has been a little while but I have updated SQLCover to include a number of fixes and small features, the majority of which are improvements to the html output: For full details and to download the latest version see: https://github.com/GoEddie/SQLCover/releases/tag/0.5.0 or https://www.nuget.org/packages/GOEddie.SQLCover/0.5.0 If you get any issues please comment below or raise an issue on github. Highlights Cobertura Cobertura is a format for code coverage tools, Azure DevOps supports cobertura files to display code coverage results alongside the build so this is a really nice thing to be able to have, if you use SQLCover in your Azure DevOps builds (or any ci server that supports Cobertura files) then you can use the Cobertura output to generate this:

How to test ETL Processes in production

This is the final part in the four-part series into testing ETL pipelines, how exciting! Part 1 - Unit Testing https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ Part 2 - Integration Testing https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/ Part 3 - Validating the upstream data https://the.agilesql.club/2019/09/how-do-test-the-upstream-data-is-good-in-an-etl-process-etl-testing-part-3/ This final part is the last step, you have documented your business logic with unit tests, you have validated your pipeline with sample data (good and bad data), you have a step in your pipeline to ensure the upstream data meets your expectations and you have deployed the code to production where, AND ONLY where, you can be confident the code works.

How to connect spark to ms sql server without '[Error] [JvmBridge] java.sql.SQLException: No suitable driver'?

“[Error] [JvmBridge] java.sql.SQLException: No suitable driver” - unable to connect spark to Microsoft SQL Server. In spark when you want to connect to a database you use Read() passing in the format “jdbc” and include the options of url, driver and either dbtable or query. DataFrame dataFrame = spark.Read() .Format("jdbc") .Option("url","jdbc:sqlserver://localhost;databaseName=dName;") .Option("user", "user_name") .Option("password", "password or secret") .Option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") .Option("dbtable", "schema.table_name") .Load(); The url tells jdbc that we want to connect to sqlserver (jdbc:sqlserver) and then the details of the server to connect to.

How do test the upstream data is good in an etl process (etl testing part 3)?

In the first two parts of this series (https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ and https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/), I talked about how to unit test your business logic and integration test your ETL infrastructure code. Having these tests ensures that your code is in order, it means you have documented, and future-proofed your code which is a fantastic thing to have. What testing our code doesn't give us is a way to validate the data we receive is correct.