Ed Elliott's Agile SQL Club

Synapse Analytics and .NET for Apache Spark Example 1 - Group By

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

How to get data in a DataFrame via .NET for Apache Spark

When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame? This post will cover all the ways to get data into a DataFrame in .

Git in 10 commands to do 99% of gitting and keeping out of trouble

Git is hard, probably harder than it needs to be but I have been using it for about 5 years and have a workflow that works for me. I use git from a command line and have learnt how to use these ten commands. If I need to deal with a merge conflict, I use VS Code and the gitlens extension. If I need to do anything else then I probably copy out the files I want to keep, reset my local repo or delete it and clone a new repo then paste back the files I want to include in the change.

TF-IDF in .NET for Apache Spark Using Spark ML v2

Spark ML in .NET for Apache Spark Apache Spark has had a machine learning API for quite some time and this has been partially implemented in .NET for Apache Spark. In this post we will look at how we can use the Apache Spark ML API from .NET. This is the second version of this post, the first version was written before version 1 of .NET for Apache Spark and there was a vital piece of the implementation missing which meant although we could build the model in .

Blog .NET for Apache Spark UDFs Missing Shared State

The Problem When you use a UDF in .NET for Apache Spark, something like this code: class Program { static void Main(string[] args) { var spark = SparkSession.Builder().GetOrCreate(); _logging.AppendLine("Starting Select"); var udf = Functions.Udf<int, string>(theUdf); spark.Range(100).Select(udf(Functions.Col("id"))).Show(); _logging.AppendLine("Ending Select"); Console.WriteLine(_logging.ToString()); } private static readonly StringBuilder _logging = new StringBuilder(); private static string theUdf(int val) { _logging.AppendLine($"udf passed: {val}"); return $"udf passed {val}"; } } Generally, knowing .NET we would expect the following output:

TF-IDF in .NET for Apache Spark Using Spark ML

Last Updated: 2020-10-18 NOTE: What you read here was before .NET for Apache Spark 1.0 which includes everything we need to do this purely in .NET - in this post you will see an example that is no longer necessary for TF-IDF, instead view: https://the.agilesql.club/2020/12/spark-dotnet-tf-idf. Spark ML in .NET for Apache Spark Spark is awesome, .NET is awesome, machine learning (ML) is awesome, so what could be better than using .

Approaches to running Databricks ETL code from Azure ADF

Databricks is fantastic, but there is a small issue with how people use it. The problem is that Databricks is all things to all people. Data scientists and data analysts use Databricks to explore their data and write cool things. ML engineers use it to get their models to execute somewhere. Meanwhile, the cool kids (data engineers obviously) use it to run their ETL code. Some use cases favour instant access to “all the datas”, some favour rigorous engineering discipline so when we look at Databricks it is a case of one size does not fit all.

The four tenets of ETL testing

Every ETL pipeline is only ever as reliable as the data that the upstream system provides. It is inevitable that assumptions you make about the data you are provided will be shattered and there is absolutely nothing you can do about it. So what can we do? Do we just accept that our pipelines will break and fix them when the CEO shouts that the figures are out or even worse if no one notices and the data is wrong for months or years?

Passing status messages and results back from Databricks to ADF

When we use ADF to call Databricks we can pass parameters, nice. When we finish running the Databricks notebook we often want to return something back to ADF so ADF can do something with it. Think that Databricks might create a file with 100 rows in (actually big data 1,000 rows) and we then might want to move that file or write a log entry to say that 1,000 rows have been written.

Spark Delta Lake, Updates, Deletes and Time Travel

When you use delta lake there are a couple of interesting things to note based around the fact that the data is stored in parquet files which are read-only and delta lake includes the ability to delete and update data and view the state of a table at a specific point in time. Obviously read-only and updates and deletes don’t exactly sound like they work together, so how does it all work and what do we need to be aware of?