"Testing, testing, testing, to get started with automated ETL (ELT) testing have a look here: https://the.agilesql.club/etl-testing/
I really like the new dotnet driver for Spark because I think it makes spark more accesable to devs who might not know pythpn or scala.
If you want to be able to build and run a dotnet application using the dotnet driver to run locally you will need:
- dotnet sdk
Now the JRE has some docker images and so does the dotnet sdk, but what if you want both in a single container? The simplest way to get this setup is to use both dotnet core image and jre images and create a multi-stage dockerfile. For a working example see:
The interesting bits are at the beginning:
FROM mcr.microsoft.com/dotnet/core/sdk:2.2 as core FROM openjdk:8u212-jre ADD ./spark-2.4.3-bin-hadoop2.7.tgz /usr/local/ COPY --from=core /usr/share/dotnet /usr/share/dotnet CMD "ln -s /usr/share/dotnet/dotnet"
This is quite new syntax in docker and you need at least docker 17.05 (client and daemon), after the images “FROM blah” you can specify a name “core” in this case, then later you can copy from the first image to the second using “–from=” on the “COPY” command.
In this dockerfile I have added Spark 2.4.3 and the default environment variables we need to get spark running, if you grab this dockerfile and run “docker build -t dotnet-spark .” you should get an images you can then run which includes the dependencies for dotnet as well as spark.
Once you have run “docker build -t dotnet-spark .” to build the image, you can create an instance of the image by doing “docker run -it dotnet-spark bash”.
You can test spark works by running
spark-shell which should give you a nifty spark shell, you can quit that by typing
:q and then test dotnet by running
Lets run an example project to see this all work together, if you clone this project https://github.com/GoEddie/dotnet-spark-article, in the running container do:
cd ~ git clone https://github.com/GoEddie/dotnet-spark-article.git
For this demo we will use the latest hour price paid data from the uk so download the latest data into the running docker container:
cd ~ wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-monthly-update-new-version.csv
The next thig is that we need to download the nuget packages that the demo solution uses but we want to restore them to a known path because we need to reference the jar files inside the Microsoft.Spark package:
cd ~/dotnet-spark-article/HousePrices-Core dotnet restore --packages ./packages
You should now have the spark driver jar in: “~/dotnet-spark-article/HousePrices-Core /packages/microsoft.spark/0.3.0/jars”
Lets make sure we can build our demo app:
So what have we got?
We should have, a version 8 jre, the dotnet 2.2 sdk, a demo project, the dotnet spark driver (from Microsoft.Spark nuget package) and a csv file with a load of house prices from the uk.
Let’s submit our dotnet app to spark and get a result:
spark-submit --class org.apache.spark.deploy.DotnetRunner --master local ~/dotnet-spark-article/HousePrices-Core/packages/microsoft.spark/0.3.0/jars/microsoft-spark-2.4.x-0.3.0.jar dotnet run ~/pp-monthly-update-new-version.csv
What this does is start spark using spark-submit then tell spark to start the dotnet driver “–class org.apache.spark.deploy.DotnetRunner” then we will use a local instance of spark “–master local” then we tell spark where to find our driver “~/dotnet-spark-article/HousePrices-Core/packages/microsoft.spark/0.3.0/jars/microsoft-spark-2.4.x-0.3.0.jar”, what we want the driver to launch “dotnet run” and we pass the path to the csv to our app as the first argument “~/pp-monthly-update-new-version.csv”.
Hopefully, this outputs the average house prices over the last few years, if you want to see this work on a larger dataset then, instead of the last months data you can get the historical data by visiting https://data.gov.uk/dataset/4c9b7641-cf73-4fd9-869a-4bfeed6d440e/hm-land-registry-price-paid-data