Spark Connect Dotnet November 2024 Where are we?

All Spark Connect Posts Introduction There have been quite a few changes in the last couple of months and I just wanted to give a quick update on the current state of the project. In terms of usage I am starting to hear from people using the library and submitting pr’s and requests so although usage is pretty low (which is expected from the fact that the Microsoft supported version usage wasn’t very high) it is growing which is interesting.

Spark Connect Dotnet Variant Data Type

All Spark Connect Posts I recently published the latest version of the Spark Connect Dotnet library which includes support for the new Variant data type in Apache Spark 4.0 here. One of the new features of Spark 4.0 is the Variant data type which is a faster way of processing Json data (see here). Sample data For this post I used a copy of the sample data from Adobe (https://opensource.adobe.com/Spry/samples/data_region/JSONDataSetSample.html).

Delta Lake from .NET in a Spark Connect gRPC world

All Spark Connect Posts Code What to do? At some point we will want to do something with delta lake and so I wanted to explore the options. Before we do that there is a little explaining to do about delta lake and Spark. There are two completely separate sides to this, the first is getting Spark to read and write in delta format and the second is performing operations on the factual files directly without using Spark, operations like Vaccum etc.

Apache Spark from PHP - it is not just a .NET thing

All Spark Connect Posts Code Goal of this post I wanted to explore what the Spark Connect API looked like from other languages, I am not a php developer - I used it a long time ago and read up on some of the modern changes but apologies if I insult any php-ers! I will say that I quite like php. Setup The instructions are from https://grpc.io/docs/languages/php/quickstart/ with an extra step.

Implementing functions and more fun in Spark Connect using gRPC and .NET

All Spark Connect Posts Code Goal of this post The goal of this post is to look at creating a SparkSession and a DataFrame that will wrap the Range relation and then we will use the WithColumn function to add a column to the DataFrame and then we will use the Show function to show the DataFrame. We won’t have a builder but we are moving towards: var spark = SparkSession .

Moving towards the DataFrame API using the Spark Connect gRPC API in .NET

All Spark Connect Posts Code Goal of this post So there are two goals of this post, the first is to take a look at Apache Arrow and how we can do things like show the output from DataFrame.Show, the second is to start to create objects that look more familiar to us, i.e. the DataFrame API. I want to take it in small steps, I 100% know that this sort of syntax is possible:

Exploring the Spark Connect gRPC API more

All Spark Connect Posts Code Goal of this post In this post we will continue looking at the gRPC API and the AnalyzePlan method which takes a plan and analyzes it. To be honest I expected this to be longer but decided just to do the AnalyzePlan method. There are a few more API’s like ReleaseExecute, InterruptAsync, and ReattachExecute that I was going to cover but changed my mind so consider this part of the last post :).

Exploring the Spark Connect gRPC API

All Spark Connect Posts Code Goal of this post In the first two posts, we looked at how to run some Spark code, firstly against a local Spark Connect server and then against a Databricks cluster. In this post, we will look more at the actual gRPC API itself, namely ExecutePlan, Config, and AddArtifacts/ArtifactsStatus. SparkConnectService.SparkConnectServiceClient The way we call the API is using the SparkConnectServiceClient, when we take the .proto files from the Spark repo, and we add them to a visual studio with the Protobuf build action that comes from the Grpc.

Using Spark Connect from .NET to run Spark jobs on Databricks

All Spark Connect Posts Goal of this post This post aims to show how we can create a .NET application, deploy it to Databricks, and then run a Databricks job that calls our .NET code, which uses Spark Connect to run a Spark job on the Databricks job cluster to write some data out to Azure storage. In the previous post, I showed how to use the Range command to create a Spark DataFrame and then save it locally as a parquet file.

Spark Connect Dotnet First Major Milestone

All Spark Connect Posts When I wrote the spark-connect-dotnet lib I didn’t envisage that I would implement every function, instead it would be a combination of implementing the most common functionality and showing people how they can make their own gRPC calls to call the rest of the functions but what I found is that actually implementing the functions once I had figured out the shared functionality was pretty easy and by implementing all of the functions I was able to get the supporting functions like collecting data back through arrow working.