Synapse Analytics and .NET for Apache Spark Example 1 - Group By

I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.

I have created a Synapse Alaytics repo https://github.com/GoEddie/SynapseSparkExamples which includes a set of notebooks that work with the sample data shipped with Synapse Analytics and three sections per notebook (T-SQL, Spark SQL, DataFrame API (C#).

You can deploy the repo to a test Synapse Analytics workspace to see the examples in action, but the code itself is a JSON document so quite hard to see. So I thought I would create a blog post for each example so you can have your literal cake and eat it by viewing the code here or by deploying the code and running it yourself.

Example 1 - Group By

T-SQL

SELECT DataType, COUNT(*) 
    FROM chicago.safety_data
    GROUP BY DataType

Spark SQL

SELECT DataType, COUNT(*) 
    FROM chicago.safety_data
    GROUP BY DataType

DataFrame API (C#)

ar dataFrame = spark.Read().Table("chicago.safety_data");
var groupped = dataFrame
    .GroupBy("DataType")
    .Agg(Functions.Count("DataType"));

groupped.Show();

There you have it, the first example.

To see this in action, please feel free to deploy this repo to your Synapse Analytics repo: https://github.com/GoEddie/SynapseSparkExamples