TF-IDF in .NET for Apache Spark Using Spark ML v2

Spark ML in .NET for Apache Spark Apache Spark has had a machine learning API for quite some time and this has been partially implemented in .NET for Apache Spark. In this post we will look at how we can use the Apache Spark ML API from .NET. This is the second version of this post, the first version was written before version 1 of .NET for Apache Spark and there was a vital piece of the implementation missing which meant although we could build the model in .

Blog .NET for Apache Spark UDFs Missing Shared State

The Problem When you use a UDF in .NET for Apache Spark, something like this code: class Program { static void Main(string[] args) { var spark = SparkSession.Builder().GetOrCreate(); _logging.AppendLine("Starting Select"); var udf = Functions.Udf<int, string>(theUdf); spark.Range(100).Select(udf(Functions.Col("id"))).Show(); _logging.AppendLine("Ending Select"); Console.WriteLine(_logging.ToString()); } private static readonly StringBuilder _logging = new StringBuilder(); private static string theUdf(int val) { _logging.AppendLine($"udf passed: {val}"); return $"udf passed {val}"; } } Generally, knowing .NET we would expect the following output:

TF-IDF in .NET for Apache Spark Using Spark ML

Last Updated: 2020-10-18 NOTE: What you read here was before .NET for Apache Spark 1.0 which includes everything we need to do this purely in .NET - in this post you will see an example that is no longer necessary for TF-IDF, instead view: https://the.agilesql.club/2020/12/spark-dotnet-tf-idf. Spark ML in .NET for Apache Spark Spark is awesome, .NET is awesome, machine learning (ML) is awesome, so what could be better than using .

Approaches to running Databricks ETL code from Azure ADF

Databricks is fantastic, but there is a small issue with how people use it. The problem is that Databricks is all things to all people. Data scientists and data analysts use Databricks to explore their data and write cool things. ML engineers use it to get their models to execute somewhere. Meanwhile, the cool kids (data engineers obviously) use it to run their ETL code. Some use cases favour instant access to “all the datas”, some favour rigorous engineering discipline so when we look at Databricks it is a case of one size does not fit all.

The four tenets of ETL testing

Every ETL pipeline is only ever as reliable as the data that the upstream system provides. It is inevitable that assumptions you make about the data you are provided will be shattered and there is absolutely nothing you can do about it. So what can we do? Do we just accept that our pipelines will break and fix them when the CEO shouts that the figures are out or even worse if no one notices and the data is wrong for months or years?

Passing status messages and results back from Databricks to ADF

When we use ADF to call Databricks we can pass parameters, nice. When we finish running the Databricks notebook we often want to return something back to ADF so ADF can do something with it. Think that Databricks might create a file with 100 rows in (actually big data 1,000 rows) and we then might want to move that file or write a log entry to say that 1,000 rows have been written.

Spark Delta Lake, Updates, Deletes and Time Travel

When you use delta lake there are a couple of interesting things to note based around the fact that the data is stored in parquet files which are read-only and delta lake includes the ability to delete and update data and view the state of a table at a specific point in time. Obviously read-only and updates and deletes don’t exactly sound like they work together, so how does it all work and what do we need to be aware of?

Validating upstream data quality in ETL processes, SQL edition

It is a non-null constraint, not a non-ish-null constraint You are writing an ETL process, part of this process you need to import a semi-structured file (think CSV, JSON, XM-bleurgh-L, etc.) when you import the data into an RDBMS you get all sorts of things that make schema designers excited like unique constraints and check constraints. The problem is that the file you are importing is from another system and all “other” systems in the world make mistakes and changes and send you duff data that won’t work with your lovely constraints.

SQLCover 0.5 - Fixes, smaller features and an exciting surprise

It has been a little while but I have updated SQLCover to include a number of fixes and small features, the majority of which are improvements to the html output: For full details and to download the latest version see: https://github.com/GoEddie/SQLCover/releases/tag/0.5.0 or https://www.nuget.org/packages/GOEddie.SQLCover/0.5.0 If you get any issues please comment below or raise an issue on github. Highlights Cobertura Cobertura is a format for code coverage tools, Azure DevOps supports cobertura files to display code coverage results alongside the build so this is a really nice thing to be able to have, if you use SQLCover in your Azure DevOps builds (or any ci server that supports Cobertura files) then you can use the Cobertura output to generate this:

How to test ETL Processes in production

This is the final part in the four-part series into testing ETL pipelines, how exciting! Part 1 - Unit Testing https://the.agilesql.club/2019/07/how-do-we-test-etl-pipelines-part-one-unit-tests/ Part 2 - Integration Testing https://the.agilesql.club/2019/08/how-do-we-prove-our-etl-processes-are-correct-how-do-we-make-sure-upstream-changes-dont-break-our-processes-and-break-our-beautiful-data/ Part 3 - Validating the upstream data https://the.agilesql.club/2019/09/how-do-test-the-upstream-data-is-good-in-an-etl-process-etl-testing-part-3/ This final part is the last step, you have documented your business logic with unit tests, you have validated your pipeline with sample data (good and bad data), you have a step in your pipeline to ensure the upstream data meets your expectations and you have deployed the code to production where, AND ONLY where, you can be confident the code works.