Spark Connect Dotnet First Major Milestone

All Spark Connect Posts

When I wrote the spark-connect-dotnet lib I didn’t envisage that I would implement every function, instead it would be a combination of implementing the most common functionality and showing people how they can make their own gRPC calls to call the rest of the functions but what I found is that actually implementing the functions once I had figured out the shared functionality was pretty easy and by implementing all of the functions I was able to get the supporting functions like collecting data back through arrow working.

The library now implements all of the functions that PySpark implements under [pyspark.sql.functions] (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions).

What next?

Apache Spark 4.0

I’ll likely work on this next, i’ve already tested with Spark 4.0 and all the tests pass which is good, there is the new Variant type that should be interesting to implement, as well as the new 4.0 functions.

Arrow Support

There are some gaps in the Arrow support, particularly around building complex types like structs and arrays in CreateDataFrame (which I think are probably used more for examples that in real world scenarios and can be worked around by using spark.Sql("array('a', 'b') array")) etc).

UDF’s

I also need to figure out what to do about UDF’s and whether to expose a way to call Python UDF’s or whether we even need UDF’s at all. There are a lot of new functions in Spark 3.5.0 and 4.0 which mean that people are less and less going to need to use UDF’s but I know some people will have a need for them.

ML

Looking at the ML support in PySpark connect it looks like PySpark is using Torch etc so not sure how to implement this yet.

## Lessons Learned so far

Implementing the Spark functions via Spark Connect has been interesting, there are a few things that I have learned:

  • Implementing functions manually is slow, i’d prefer to spend my time gathering metadata and generating the functions
  • The PySpark docs include examples of every function, if these could be used to write tests that would be amazing
  • I’ve implemented the functions using .NET naming standards, perhaps I should have used the scala/pyspark naming standard
  • I need to work on how the Arrow data is encoded and decoded, it is so critical and I don’t think I have it quite right
  • c# needs some more synatic sugar support, for the PySpark examples they often used createDataFrame and pass in an array - creating arrays in c# is harder than in Python and keep trying to play around with different options
  • I should have done a better job of copying the documentation into the function headers, in the first pass I skipped it in the hope users will look at the PySpark docs but I think it would be better in the .NET code so I can generate docs from that