How do you debug your spark-dotnet app in visual studio?

When you run an application using spark-dotnet, to launch the application you need to use spark-submit to start a java virtual machine which starts the spark-dotnet driver which then runs your program so that leaves us a problem, how to write our programs in visual studio and press f5 to debug? There are two approaches, one I have used for years with dotnet when I want to debug something that is challenging to get a debugger attached - think apps which spawn other processes and they fail in the startup routine.

How do we prove our ETL processes are correct? How do we make sure upstream changes don't break our processes and break our beautiful data?

ETL Testing Part 2 - Operational Data Testing This is the second part of a series on ETL testing, the first part explained about unit testing, and in this part, we will talk about how we can prove the correctness of the actual data, both today and in the future after every ETL run. Testing ETL processes is a multi-layered beast, we need to understand the different types of test, what they do for us, and how to actually implement them.

spark-dotnet how to manually break a file into rows and columns

I found this question on stack overflow that went something like this: “I have a file that includes line endings in the wrong place and I need to parse the text manually into rows” (https://stackoverflow.com/questions/57294619/read-a-textfile-of-fixed-length-with-newline-as-one-of-attribute-value-into-a-ja/57317527). I thought it would be interesting to implement this with what we have available today in spark-dotnet. The thing is though that even though this is possible in spark-dotnet or the other versions of spark, I would pre-process the file in something else and by the time spark reads the file have it already in a suitable format.

java.lang.ClassNotFoundException: org.apache.spark.deploy.DotnetRunner with 0.4.0 of spark-dotnet

There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark. To fix the issue you need to use the new package name which adds an extra dotnet near the end, change: spark-submit --class org.apache.spark.deploy.DotnetRunner into: spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner What if I have this error but that doesn’t fix it? When you run a spark app using spark-submit and you get a ClassNotFoundException for the driver then it boils down to either making a typo or something on your system blocking the jar from being loaded (anti-virus?

How do we test ETL pipelines? Part one unit tests

Why do we bother testing? Testing isn’t an easy thing to define, we all know we should do it, when something goes wrong in production people shout and ask where the tests were, hell even auditors like to see evidence of tests (whether or not they are good isn’t generally part of an audit) . What do we test, how and why do we even write tests? It is all well and good saying “write unit tests and integration tests” but what do we test?

spark-dotnet examples - reading and writing csv files

How do you read and write CSV files using the dotnet driver for Apache Spark? I have a runnable example here: https://github.com/GoEddie/dotnet-spark-examples Specifcally: https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv Let’s take a walkthrough of the demo: Console.WriteLine("Hello Spark!"); var spark = SparkSession .Builder() .GetOrCreate(); We start with the obligatory “Hello World!”, then we create a new SparkSession. //Read a single CSV file var source = spark .Read() .Option("header", true) .Option("inferShchema", true) .Option("ignoreLeadingWhiteSpace", true) .Option("ignoreTrailingWhiteSpace", true) .

spark-dotnet how does user .net code run spark code in a java vm

Apache Spark is written in scala, scala compiles to Java and runs inside a Java virtual machine. The spark-dotnet driver runs dotnet code and calls spark functionality, so how does that work? There are two paths to run dotnet code with spark, the first is the general case which I will describe here, the second is UDF’s which I will explain in a later post as it is slightly more involved.

Spark and dotnet in a single docker container

I really like the new dotnet driver for Spark because I think it makes spark more accesable to devs who might not know pythpn or scala. If you want to be able to build and run a dotnet application using the dotnet driver to run locally you will need: jre Spark dotnet sdk Now the JRE has some docker images and so does the dotnet sdk, but what if you want both in a single container?

Where do you do your database development? Hopefully not production

Here are three scenarios, if you work with SQL Server either as a provider of database environments (DBA) or a consumer of database environments for your application (developer) then you will likely see yourself in one of these descriptions. If you don’t please 100% find some way to tell me (email, comment below etc.) Prod FTW’ers – There is only one place where the developers can develop. DBA’s (if you have one) complain every so often about someone using “sa” on production, whoever is using the “sa” account, keeps leaving open transactions in SSMS blocking all the users.

As a DBA, how do I offload some of my work?

I don’t have time for this “I am a DBA, I am busy, too busy. Developers keep pushing changes to production without me reviewing the code, and now I am stuck again, over the weekend fixing performance issues while the developers are on the beach with a pina colada and a cigar” Sound familiar? Maybe the developers aren’t on the beach drinking and smoking, but the sentiment is the same: