Ed Elliott's Agile SQL Club

Using Spark Connect from .NET to run Spark jobs on Databricks

All Spark Connect Posts Goal of this post This post aims to show how we can create a .NET application, deploy it to Databricks, and then run a Databricks job that calls our .NET code, which uses Spark Connect to run a Spark job on the Databricks job cluster to write some data out to Azure storage. In the previous post, I showed how to use the Range command to create a Spark DataFrame and then save it locally as a parquet file.

Spark Connect Dotnet First Major Milestone

All Spark Connect Posts When I wrote the spark-connect-dotnet lib I didn’t envisage that I would implement every function, instead it would be a combination of implementing the most common functionality and showing people how they can make their own gRPC calls to call the rest of the functions but what I found is that actually implementing the functions once I had figured out the shared functionality was pretty easy and by implementing all of the functions I was able to get the supporting functions like collecting data back through arrow working.

Using Spark Connect from .NET

All Spark Connect Posts Introductory Ramble Spark Connect In July 2022, at the Data and AI summit, Apache Spark announced “Spark Connect,” which was a way of connecting to Apache Spark using the gRPC protocol rather than using the supplied Java or Python APIs. I’ve been using Spark since 2017 and have been completely obsessed with how transformative it has been in the data processing space. Me Over the past couple of decades working in IT, I have found a particular interest in protocols.

ADF Tweaks - making the UI a little less frustrating

I have been using ADF for a few years now and there are some parts of the development experience that I find frustrating so I decided to do something about it. I have created a chrome extension that does two things, firstly it lets you control the width of the dynamic properties side panel and secondly lets you export the data that is shown in the preview window. If there are more things that I can add easily then I will.

ADF: Error trying to debug pipeline: BadRequest

I made a mistake recently when I was creating an ADF pipeline, annoyingly I made loads of changes and then clicked the debug button, when I pressed debug the pipeline failed to start and I was presented with this little beaut of an error message: The pipeline was quite complicated and so I didn’t know exactly what was causing it so I went through the usual ADF troubleshooting steps (save all then refesh the web page) that didn’t help.

ADF: Querying JSON documents

In my previous blog post I talked about how to read from an XML Webervice and use xpath to query the XML on the expressions side of things. You can read the XML article here (https://the.agilesql.club/2021/02/adf-xml-objects-and-xpath-in-the-expression-language/). Now, what if we don’t have XML but have JSON? Well well indeed, what if there was a way to query JSON documents using a query, imagine if you will a JSONQuery where you can pass a similar query to an xpath query to retrieve specific values from the JSON document.

ADF, XML objects and XPath in the expression language

When you use ADF, there are two sides to the coin. The first is the data itself that ADF does very well, from moving it from one site to another to flattening JSON documents and converting from CSV to Avro, to Parquet, to SQL is powerful. The other side of the coin is how ADF uses data as variables to manage the pipeline, and it is this side of the coin that I wish to talk about today.

Synapse Analytics and .NET for Apache Spark Example 4 - JOINS

This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh). T-SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address; Spark SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address DataFrame API (C#) var dataFrame = spark.

Synapse Analytics and .NET for Apache Spark Example 3 - CTE()

The next example is how to do a CTE (Common Table Expression). When creating the CTE I will also rename one of the columns from “dataType” to “x”. T-SQL WITH CTE(x, dataType, dataSubType) AS ( SELECT dateTime, dataType, dataSubType FROM chicago.safety_data ) SELECT * FROM CTE; Spark SQL WITH CTE AS (SELECT dateTime as x, dataType, dataSubType FROM chicago.safety_data) SELECT * FROM CTE DataFrame API (C#) The DataFrame example is a bit odd - by creating a data frame with the first query we have the CTE that we can use:

Synapse Analytics and .NET for Apache Spark Example 2 - ROW_NUMBER()

The next example is how to do a ROW_NUMBER(), my favourite window function. T-SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data Spark SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data DataFrame API (C#) var dataFrame = spark.Read().Table("chicago.safety_data"); var window = Microsoft.Spark.Sql.Expressions.Window.OrderBy("dateTime"); dataFrame = dataFrame.WithColumn("RowNumber", Functions.RowNumber().Over(window)); dataFrame.Show(); To see this in action, please feel free to deploy this repo to your Synapse Analytics repo: https://github.com/GoEddie/SynapseSparkExamples