All Spark Connect Posts Introductory Ramble Spark Connect In July 2022, at the Data and AI summit, Apache Spark announced “Spark Connect,” which was a way of connecting to Apache Spark using the gRPC protocol rather than using the supplied Java or Python APIs. I’ve been using Spark since 2017 and have been completely obsessed with how transformative it has been in the data processing space. Me Over the past couple of decades working in IT, I have found a particular interest in protocols.
I made a mistake recently when I was creating an ADF pipeline, annoyingly I made loads of changes and then clicked the debug button, when I pressed debug the pipeline failed to start and I was presented with this little beaut of an error message: The pipeline was quite complicated and so I didn’t know exactly what was causing it so I went through the usual ADF troubleshooting steps (save all then refesh the web page) that didn’t help.
In my previous blog post I talked about how to read from an XML Webervice and use xpath to query the XML on the expressions side of things. You can read the XML article here (https://the.agilesql.club/2021/02/adf-xml-objects-and-xpath-in-the-expression-language/). Now, what if we don’t have XML but have JSON? Well well indeed, what if there was a way to query JSON documents using a query, imagine if you will a JSONQuery where you can pass a similar query to an xpath query to retrieve specific values from the JSON document.
When you use ADF, there are two sides to the coin. The first is the data itself that ADF does very well, from moving it from one site to another to flattening JSON documents and converting from CSV to Avro, to Parquet, to SQL is powerful. The other side of the coin is how ADF uses data as variables to manage the pipeline, and it is this side of the coin that I wish to talk about today.
This is a bit of a longer one, a look at how to do all the different joins and the exciting thing for MSSQL developers is that we get a couple of extra joins (semi and anti semi oooooooh). T-SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address; Spark SQL SELECT * FROM chicago.safety_data one INNER JOIN chicago.safety_data two ON one.Address = two.Address DataFrame API (C#) var dataFrame = spark.
The next example is how to do a CTE (Common Table Expression). When creating the CTE I will also rename one of the columns from “dataType” to “x”. T-SQL WITH CTE(x, dataType, dataSubType) AS ( SELECT dateTime, dataType, dataSubType FROM chicago.safety_data ) SELECT * FROM CTE; Spark SQL WITH CTE AS (SELECT dateTime as x, dataType, dataSubType FROM chicago.safety_data) SELECT * FROM CTE DataFrame API (C#) The DataFrame example is a bit odd - by creating a data frame with the first query we have the CTE that we can use:
The next example is how to do a ROW_NUMBER(), my favourite window function. T-SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data Spark SQL SELECT *, ROW_NUMBER() OVER(ORDER BY dateTime) as RowNumber FROM chicago.safety_data DataFrame API (C#) var dataFrame = spark.Read().Table("chicago.safety_data"); var window = Microsoft.Spark.Sql.Expressions.Window.OrderBy("dateTime"); dataFrame = dataFrame.WithColumn("RowNumber", Functions.RowNumber().Over(window)); dataFrame.Show(); To see this in action, please feel free to deploy this repo to your Synapse Analytics repo: https://github.com/GoEddie/SynapseSparkExamples
I have been playing around with the new Azure Synapse Analytics, and I realised that this is an excellent opportunity for people to move to Apache Spark. Synapse Analytics ships with .NET for Apache Spark C# support many people will surely try to convert T-SQL code or SSIS code into Apache Spark code. I thought it would be awesome if there were a set of examples of how to do something in T-SQL, then translated into how to do that same thing in Spark SQL and the Spark DataFrame API in C#.
When I first started working with Apache Spark, one of the things I struggled with was that I would have some variable or data in my code that I wanted to work on with Apache Spark. To get the data in a state that Apache Spark can process it involves putting the data into a DataFrame. How do you take some data and get it into a DataFrame? This post will cover all the ways to get data into a DataFrame in .
Git is hard, probably harder than it needs to be but I have been using it for about 5 years and have a workflow that works for me. I use git from a command line and have learnt how to use these ten commands. If I need to deal with a merge conflict, I use VS Code and the gitlens extension. If I need to do anything else then I probably copy out the files I want to keep, reset my local repo or delete it and clone a new repo then paste back the files I want to include in the change.