Ed Elliott's Agile SQL Club

Delta Lake over Spark Connect

All Spark Connect Posts I have just finished an update for the spark connect dotnet lib that contains the DeltaTable implementation so that we can now use .NET to maintain delta tables, over and above what we get out of the box by using DataFrame.Write.Format("delta"), this is an example of how to use the delta api from .NET: var deltaTable = DeltaTable.ForPath(spark, deltaPath); deltaTable.History().Show(10, 1000); deltaTable.Update("id < 10", (Col("id"), Lit(0))); deltaTable.

ADF: Publish suddenly includes everything where it used to be incremental changes since the last publish

I recently encountered an interesting issue with ADF where the publish feature suddenly attempted to republish every single object, claiming they were new, despite having incrementally published changed objects for some time. We were using the publish feature where you work on a branch until you are happy, then you raise a PR to main, merge to main, and then switch back to ADF and click publish to push the changes to the adf_publish branch.

Spark Connect Dotnet November 2024 Where are we?

All Spark Connect Posts Introduction There have been quite a few changes in the last couple of months and I just wanted to give a quick update on the current state of the project. In terms of usage I am starting to hear from people using the library and submitting pr’s and requests so although usage is pretty low (which is expected from the fact that the Microsoft supported version usage wasn’t very high) it is growing which is interesting.

Spark Connect Dotnet Variant Data Type

All Spark Connect Posts I recently published the latest version of the Spark Connect Dotnet library which includes support for the new Variant data type in Apache Spark 4.0 here. One of the new features of Spark 4.0 is the Variant data type which is a faster way of processing Json data (see here). Sample data For this post I used a copy of the sample data from Adobe (https://opensource.adobe.com/Spry/samples/data_region/JSONDataSetSample.html).

Delta Lake from .NET in a Spark Connect gRPC world

UPDATE - I have implemented delta in spark-connect-dotnet All Spark Connect Posts Code What to do? At some point we will want to do something with delta lake and so I wanted to explore the options. Before we do that there is a little explaining to do about delta lake and Spark. There are two completely separate sides to this, the first is getting Spark to read and write in delta format and the second is performing operations on the factual files directly without using Spark, operations like Vaccum etc.

Apache Spark from PHP - it is not just a .NET thing

All Spark Connect Posts Code Goal of this post I wanted to explore what the Spark Connect API looked like from other languages, I am not a php developer - I used it a long time ago and read up on some of the modern changes but apologies if I insult any php-ers! I will say that I quite like php. Setup The instructions are from https://grpc.io/docs/languages/php/quickstart/ with an extra step.

Implementing functions and more fun in Spark Connect using gRPC and .NET

All Spark Connect Posts Code Goal of this post The goal of this post is to look at creating a SparkSession and a DataFrame that will wrap the Range relation and then we will use the WithColumn function to add a column to the DataFrame and then we will use the Show function to show the DataFrame. We won’t have a builder but we are moving towards: var spark = SparkSession .

Moving towards the DataFrame API using the Spark Connect gRPC API in .NET

All Spark Connect Posts Code Goal of this post So there are two goals of this post, the first is to take a look at Apache Arrow and how we can do things like show the output from DataFrame.Show, the second is to start to create objects that look more familiar to us, i.e. the DataFrame API. I want to take it in small steps, I 100% know that this sort of syntax is possible:

Exploring the Spark Connect gRPC API more

All Spark Connect Posts Code Goal of this post In this post we will continue looking at the gRPC API and the AnalyzePlan method which takes a plan and analyzes it. To be honest I expected this to be longer but decided just to do the AnalyzePlan method. There are a few more API’s like ReleaseExecute, InterruptAsync, and ReattachExecute that I was going to cover but changed my mind so consider this part of the last post :).

Exploring the Spark Connect gRPC API

All Spark Connect Posts Code Goal of this post In the first two posts, we looked at how to run some Spark code, firstly against a local Spark Connect server and then against a Databricks cluster. In this post, we will look more at the actual gRPC API itself, namely ExecutePlan, Config, and AddArtifacts/ArtifactsStatus. SparkConnectService.SparkConnectServiceClient The way we call the API is using the SparkConnectServiceClient, when we take the .proto files from the Spark repo, and we add them to a visual studio with the Protobuf build action that comes from the Grpc.