Synapse Analytics and .NET for Apache Spark Example 1 - Group By
I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.
I have created a Synapse Alaytics repo https://github.com/GoEddie/SynapseSparkExamples which includes a set of notebooks that work with the sample data shipped with Synapse Analytics and three sections per notebook (T-SQL, Spark SQL, DataFrame API (C#).
You can deploy the repo to a test Synapse Analytics workspace to see the examples in action, but the code itself is a JSON document so quite hard to see. So I thought I would create a blog post for each example so you can have your literal cake and eat it by viewing the code here or by deploying the code and running it yourself.
Example 1 - Group By
T-SQL
SELECT DataType, COUNT(*)
FROM chicago.safety_data
GROUP BY DataType
Spark SQL
SELECT DataType, COUNT(*)
FROM chicago.safety_data
GROUP BY DataType
DataFrame API (C#)
ar dataFrame = spark.Read().Table("chicago.safety_data");
var groupped = dataFrame
.GroupBy("DataType")
.Agg(Functions.Count("DataType"));
groupped.Show();
There you have it, the first example.
To see this in action, please feel free to deploy this repo to your Synapse Analytics repo: https://github.com/GoEddie/SynapseSparkExamples