There was a breaking change with version 0.4.0 that changed the name of the class that is used to load the dotnet driver in Apache Spark. To fix the issue you need to use the new package name which adds an extra dotnet near the end, change: spark-submit --class org.apache.spark.deploy.DotnetRunner into: spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner What if I have this error but that doesn’t fix it? When you run a spark app using spark-submit and you get a ClassNotFoundException for the driver then it boils down to either making a typo or something on your system blocking the jar from being loaded (anti-virus?
Why do we bother testing? Testing isn’t an easy thing to define, we all know we should do it, when something goes wrong in production people shout and ask where the tests were, hell even auditors like to see evidence of tests (whether or not they are good isn’t generally part of an audit) . What do we test, how and why do we even write tests? It is all well and good saying “write unit tests and integration tests” but what do we test?
How do you read and write CSV files using the dotnet driver for Apache Spark? I have a runnable example here: https://github.com/GoEddie/dotnet-spark-examples Specifcally: https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv Let’s take a walkthrough of the demo: Console.WriteLine("Hello Spark!"); var spark = SparkSession .Builder() .GetOrCreate(); We start with the obligatory “Hello World!”, then we create a new SparkSession. //Read a single CSV file var source = spark .Read() .Option("header", true) .Option("inferShchema", true) .Option("ignoreLeadingWhiteSpace", true) .Option("ignoreTrailingWhiteSpace", true) .
Apache Spark is written in scala, scala compiles to Java and runs inside a Java virtual machine. The spark-dotnet driver runs dotnet code and calls spark functionality, so how does that work? There are two paths to run dotnet code with spark, the first is the general case which I will describe here, the second is UDF’s which I will explain in a later post as it is slightly more involved.
I really like the new dotnet driver for Spark because I think it makes spark more accesable to devs who might not know pythpn or scala. If you want to be able to build and run a dotnet application using the dotnet driver to run locally you will need: jre Spark dotnet sdk Now the JRE has some docker images and so does the dotnet sdk, but what if you want both in a single container?
Here are three scenarios, if you work with SQL Server either as a provider of database environments (DBA) or a consumer of database environments for your application (developer) then you will likely see yourself in one of these descriptions. If you don’t please 100% find some way to tell me (email, comment below etc.) Prod FTW’ers – There is only one place where the developers can develop. DBA’s (if you have one) complain every so often about someone using “sa” on production, whoever is using the “sa” account, keeps leaving open transactions in SSMS blocking all the users.
I don’t have time for this “I am a DBA, I am busy, too busy. Developers keep pushing changes to production without me reviewing the code, and now I am stuck again, over the weekend fixing performance issues while the developers are on the beach with a pina colada and a cigar” Sound familiar? Maybe the developers aren’t on the beach drinking and smoking, but the sentiment is the same:
In my blog post here https://the.agilesql.club/2019/06/what-steps-are-there-to-move-to-safe-automated-database-deployments/ I described the steps you need to go through so you can build up your confidence that you are capable of deploying databases using automation. I mean, afterall, knowing that it is possible to automate your deployments and having confidence that they will succeed are two very different things. Even with the best tooling in the world, automated database deployments are still a struggle and there is one key thing that you can do, no matter what tools you choose and that is to make the deployments re-runnable.
Database deployments are scary, you have all this data and if you drop the wrong table, run the wrong delete statement or have an error in a stored procedure that forgets to write that critical piece of data then you may never truly recover from that. You may well have backups but what if your backup is corrupt? What if your stored procedure hasn’t been writing the right data for a week?
Database Engineering: Database Modelling: Do I need to have an incrementing identity int/bigint as my clustered index in a SQL Server database? When you want to produce a professional table design that will scale in the future and stop you being called at 4 AM to fix a performance issue, will you regret that decision to not add an incrementing “id” column to that core table? When you look in forums, do you see people fiercely guarding their opinion that you should/should not include an id column?