Synapse Analytics and .NET for Apache Spark Example 4 - JOINS

This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh). T-SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address; Spark SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address DataFrame API (C#) var dataFrame = spark.

Synapse Analytics and .NET for Apache Spark Example 3 - CTE()

The next example is how to do a CTE (Common Table Expression). When creating the CTE I will also rename one of the columns from “dataType” to “x”. T-SQL WITH CTE(x, dataType, dataSubType) AS ( SELECT dateTime, dataType, dataSubType FROM chicago.safety_data ) SELECT * FROM CTE; Spark SQL WITH CTE AS (SELECT dateTime as x, dataType, dataSubType FROM chicago.safety_data) SELECT * FROM CTE DataFrame API (C#) The DataFrame example is a bit odd - by creating a data frame with the first query we have the CTE that we can use:

Synapse Analytics and .NET for Apache Spark Example 2 - ROW_NUMBER()

The next example is how to do a ROW_NUMBER(), my favourite window function. T-SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data Spark SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data DataFrame API (C#) var dataFrame = spark.Read().Table("chicago.safety_data"); var window = Microsoft.Spark.Sql.Expressions.Window.OrderBy("dateTime"); dataFrame = dataFrame.WithColumn("RowNumber", Functions.RowNumber().Over(window)); dataFrame.Show(); To see this in action, please feel free to deploy this repo to your Synapse Analytics repo: https://github.com/GoEddie/SynapseSparkExamples

Synapse Analytics and .NET for Apache Spark Example 1 - Group By

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

How to get data in a DataFrame via .NET for Apache Spark

When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame? This post will cover all the ways to get data into a DataFrame in .

Git in 10 commands to do 99% of gitting and keeping out of trouble

Git is hard, probably harder than it needs to be but I have been using it for about 5 years and have a workflow that works for me. I use git from a command line and have learnt how to use these ten commands. If I need to deal with a merge conflict, I use VS Code and the gitlens extension. If I need to do anything else then I probably copy out the files I want to keep, reset my local repo or delete it and clone a new repo then paste back the files I want to include in the change.

TF-IDF in .NET for Apache Spark Using Spark ML v2

Spark ML in .NET for Apache Spark Apache Spark has had a machine learning API for quite some time and this has been partially implemented in .NET for Apache Spark. In this post we will look at how we can use the Apache Spark ML API from .NET. This is the second version of this post, the first version was written before version 1 of .NET for Apache Spark and there was a vital piece of the implementation missing which meant although we could build the model in .

Blog .NET for Apache Spark UDFs Missing Shared State

The Problem When you use a UDF in .NET for Apache Spark, something like this code: class Program { static void Main(string[] args) { var spark = SparkSession.Builder().GetOrCreate(); _logging.AppendLine("Starting Select"); var udf = Functions.Udf<int, string>(theUdf); spark.Range(100).Select(udf(Functions.Col("id"))).Show(); _logging.AppendLine("Ending Select"); Console.WriteLine(_logging.ToString()); } private static readonly StringBuilder _logging = new StringBuilder(); private static string theUdf(int val) { _logging.AppendLine($"udf passed: {val}"); return $"udf passed {val}"; } } Generally, knowing .NET we would expect the following output:

TF-IDF in .NET for Apache Spark Using Spark ML

Last Updated: 2020-10-18 NOTE: What you read here was before .NET for Apache Spark 1.0 which includes everything we need to do this purely in .NET - in this post you will see an example that is no longer necessary for TF-IDF, instead view: https://the.agilesql.club/2020/12/spark-dotnet-tf-idf. Spark ML in .NET for Apache Spark Spark is awesome, .NET is awesome, machine learning (ML) is awesome, so what could be better than using .

Approaches to running Databricks ETL code from Azure ADF

Databricks is fantastic, but there is a small issue with how people use it. The problem is that Databricks is all things to all people. Data scientists and data analysts use Databricks to explore their data and write cool things. ML engineers use it to get their models to execute somewhere. Meanwhile, the cool kids (data engineers obviously) use it to run their ETL code. Some use cases favour instant access to “all the datas”, some favour rigorous engineering discipline so when we look at Databricks it is a case of one size does not fit all.