spark-dotnet how to manually break a file into rows and columns

I found this question on stack overflow that went something like this:

“I have a file that includes line endings in the wrong place and I need to parse the text manually into rows” (https://stackoverflow.com/questions/57294619/read-a-textfile-of-fixed-length-with-newline-as-one-of-attribute-value-into-a-ja/57317527).

I thought it would be interesting to implement this with what we have available today in spark-dotnet. The thing is though that even though this is possible in spark-dotnet or the other versions of spark, I would pre-process the file in something else and by the time spark reads the file have it already in a suitable format. On that note, lets look at the problem.

java.lang.ClassNotFoundException: org.apache.spark.deploy.DotnetRunner with 0.4.0 of spark-dotnet

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark.

To fix the issue you need to use the new package name which adds an extra dotnet near the end, change:

spark-submit --class org.apache.spark.deploy.DotnetRunner

into:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner

What if I have this error but that doesn’t fix it?

When you run a spark app using spark-submit and you get a ClassNotFoundException for the driver then it boils down to either making a typo or something on your system blocking the jar from being loaded (anti-virus?).

How do we test ETL pipelines? Part one unit tests

Why do we bother testing?

Testing isn’t an easy thing to define, we all know we should do it, when something goes wrong in production people shout and ask where the tests were, hell even auditors like to see evidence of tests (whether or not they are good isn’t generally part of an audit) . What do we test, how and why do we even write tests? It is all well and good saying “write unit tests and integration tests” but what do we test? We are writing ETL pipelines, the people who run the source system can’t even tell us if the files are in CSV, JSON, or double-dutch – they certainly don’t have a schema that persists more than ten minutes.

spark-dotnet examples - reading and writing csv files

How do you read and write CSV files using the dotnet driver for Apache Spark?

I have a runnable example here:

https://github.com/GoEddie/dotnet-spark-examples

Specifcally:

https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv

Let’s take a walkthrough of the demo:

            Console.WriteLine("Hello Spark!");

            var spark = SparkSession
                .Builder()
                .GetOrCreate();

We start with the obligatory “Hello World!”, then we create a new SparkSession.


            //Read a single CSV file
            var source = spark
                            .Read()
                            .Option("header", true)
                            .Option("inferShchema", true)
                            .Option("ignoreLeadingWhiteSpace", true)
                            .Option("ignoreTrailingWhiteSpace", true)
                            .Csv("./source.csv");

Here we read a csv file, the first line of the file contains the column names so we set the option “header” to true. We want to take a guess at the schema so use the option inferSchema. Because the csv has been written with lots of whitespace like “col1, col2” - if we didn’t ignore leading whitespace our second column would be called " col2" which isn’t ideal.

spark-dotnet how does user .net code run spark code in a java vm

Apache Spark is written in scala, scala compiles to Java and runs inside a Java virtual machine. The spark-dotnet driver runs dotnet code and calls spark functionality, so how does that work?

There are two paths to run dotnet code with spark, the first is the general case which I will describe here, the second is UDF’s which I will explain in a later post as it is slightly more involved.

Spark and dotnet in a single docker container

I really like the new dotnet driver for Spark because I think it makes spark more accesable to devs who might not know pythpn or scala.

If you want to be able to build and run a dotnet application using the dotnet driver to run locally you will need:

  • jre
  • Spark
  • dotnet sdk

Now the JRE has some docker images and so does the dotnet sdk, but what if you want both in a single container? The simplest way to get this setup is to use both dotnet core image and jre images and create a multi-stage dockerfile. For a working example see:

Where do you do your database development? Hopefully not production

Here are three scenarios, if you work with SQL Server either as a provider of database environments (DBA) or a consumer of database environments for your application (developer) then you will likely see yourself in one of these descriptions. If you don’t please 100% find some way to tell me (email, comment below etc.)

Prod FTW’ers – There is only one place where the developers can develop. DBA’s (if you have one) complain every so often about someone using “sa” on production, whoever is using the “sa” account, keeps leaving open transactions in SSMS blocking all the users. Sometimes developers delete the wrong thing, it happens less now that they generally use a tool that warns if the where clause is missing from an update or a delete. Hopefully, not too many identify with this scenario today (past sins have already been forgiven).

As a DBA, how do I offload some of my work?

I don’t have time for this

“I am a DBA, I am busy, too busy. Developers keep pushing changes to production without me reviewing the code, and now I am stuck again, over the weekend fixing performance issues while the developers are on the beach with a pina colada and a cigar” Sound familiar? Maybe the developers aren’t on the beach drinking and smoking, but the sentiment is the same:

What is the key to automated database deployments?

In my blog post here https://the.agilesql.club/2019/06/what-steps-are-there-to-move-to-safe-automated-database-deployments/ I described the steps you need to go through so you can build up your confidence that you are capable of deploying databases using automation. I mean, afterall, knowing that it is possible to automate your deployments and having confidence that they will succeed are two very different things.

Even with the best tooling in the world, automated database deployments are still a struggle and there is one key thing that you can do, no matter what tools you choose and that is to make the deployments re-runnable. (Insert discussion here on the word idempotent and how it means re-runnable but sounds far cooler and intellectual). If you make your deployments re-runnable then you can, by their very definiton, re-run them.

What steps are there to move to safe automated database deployments?

Database deployments are scary, you have all this data and if you drop the wrong table, run the wrong delete statement or have an error in a stored procedure that forgets to write that critical piece of data then you may never truly recover from that. You may well have backups but what if your backup is corrupt? What if your stored procedure hasn’t been writing the right data for a week? Will you be able to recover that data?