All Spark Connect Posts Code Goal of this post I wanted to explore what the Spark Connect API looked like from other languages, I am not a php developer - I used it a long time ago and read up on some of the modern changes but apologies if I insult any php-ers! I will say that I quite like php. Setup The instructions are from https://grpc.io/docs/languages/php/quickstart/ with an extra step.
All Spark Connect Posts Code Goal of this post The goal of this post is to look at creating a SparkSession and a DataFrame that will wrap the Range relation and then we will use the WithColumn function to add a column to the DataFrame and then we will use the Show function to show the DataFrame. We won’t have a builder but we are moving towards: var spark = SparkSession .
All Spark Connect Posts Code Goal of this post So there are two goals of this post, the first is to take a look at Apache Arrow and how we can do things like show the output from DataFrame.Show, the second is to start to create objects that look more familiar to us, i.e. the DataFrame API. I want to take it in small steps, I 100% know that this sort of syntax is possible:
All Spark Connect Posts Code Goal of this post In this post we will continue looking at the gRPC API and the AnalyzePlan method which takes a plan and analyzes it. To be honest I expected this to be longer but decided just to do the AnalyzePlan method. There are a few more API’s like ReleaseExecute, InterruptAsync, and ReattachExecute that I was going to cover but changed my mind so consider this part of the last post :).
All Spark Connect Posts Code Goal of this post In the first two posts, we looked at how to run some Spark code, firstly against a local Spark Connect server and then against a Databricks cluster. In this post, we will look more at the actual gRPC API itself, namely ExecutePlan, Config, and AddArtifacts/ArtifactsStatus. SparkConnectService.SparkConnectServiceClient The way we call the API is using the SparkConnectServiceClient, when we take the .proto files from the Spark repo, and we add them to a visual studio with the Protobuf build action that comes from the Grpc.
All Spark Connect Posts Goal of this post This post aims to show how we can create a .NET application, deploy it to Databricks, and then run a Databricks job that calls our .NET code, which uses Spark Connect to run a Spark job on the Databricks job cluster to write some data out to Azure storage. In the previous post, I showed how to use the Range command to create a Spark DataFrame and then save it locally as a parquet file.
All Spark Connect Posts Introductory Ramble Spark Connect In July 2022, at the Data and AI summit, Apache Spark announced “Spark Connect,” which was a way of connecting to Apache Spark using the gRPC protocol rather than using the supplied Java or Python APIs. I’ve been using Spark since 2017 and have been completely obsessed with how transformative it has been in the data processing space. Me Over the past couple of decades working in IT, I have found a particular interest in protocols.
I made a mistake recently when I was creating an ADF pipeline, annoyingly I made loads of changes and then clicked the debug button, when I pressed debug the pipeline failed to start and I was presented with this little beaut of an error message: The pipeline was quite complicated and so I didn’t know exactly what was causing it so I went through the usual ADF troubleshooting steps (save all then refesh the web page) that didn’t help.
In my previous blog post I talked about how to read from an XML Webervice and use xpath to query the XML on the expressions side of things. You can read the XML article here (https://the.agilesql.club/2021/02/adf-xml-objects-and-xpath-in-the-expression-language/). Now, what if we don’t have XML but have JSON? Well well indeed, what if there was a way to query JSON documents using a query, imagine if you will a JSONQuery where you can pass a similar query to an xpath query to retrieve specific values from the JSON document.
When you use ADF, there are two sides to the coin. The first is the data itself that ADF does very well, from moving it from one site to another to flattening JSON documents and converting from CSV to Avro, to Parquet, to SQL is powerful. The other side of the coin is how ADF uses data as variables to manage the pipeline, and it is this side of the coin that I wish to talk about today.