"Testing, testing, testing, to get started with automated ETL (ELT) testing have a look here: https://the.agilesql.club/etl-testing/

Spark and dotnet in a single docker container

I really like the new dotnet driver for Spark because I think it makes spark more accesable to devs who might not know pythpn or scala.

If you want to be able to build and run a dotnet application using the dotnet driver to run locally you will need:

  • jre
  • Spark
  • dotnet sdk

Now the JRE has some docker images and so does the dotnet sdk, but what if you want both in a single container? The simplest way to get this setup is to use both dotnet core image and jre images and create a multi-stage dockerfile. For a working example see:

https://github.com/GoEddie/docker-dotnet-spark/blob/master/Dockerfile

The interesting bits are at the beginning:

FROM mcr.microsoft.com/dotnet/core/sdk:2.2 as core

FROM openjdk:8u212-jre
ADD ./spark-2.4.3-bin-hadoop2.7.tgz /usr/local/

COPY --from=core /usr/share/dotnet /usr/share/dotnet
CMD "ln -s /usr/share/dotnet/dotnet"

This is quite new syntax in docker and you need at least docker 17.05 (client and daemon), after the images “FROM blah” you can specify a name “core” in this case, then later you can copy from the first image to the second using “–from=” on the “COPY” command.

In this dockerfile I have added Spark 2.4.3 and the default environment variables we need to get spark running, if you grab this dockerfile and run “docker build -t dotnet-spark .” you should get an images you can then run which includes the dependencies for dotnet as well as spark.

Once you have run “docker build -t dotnet-spark .” to build the image, you can create an instance of the image by doing “docker run -it dotnet-spark bash”.

You can test spark works by running spark-shell which should give you a nifty spark shell, you can quit that by typing :q and then test dotnet by running dotnet --info.

Lets run an example project to see this all work together, if you clone this project https://github.com/GoEddie/dotnet-spark-article, in the running container do:

cd ~
git clone https://github.com/GoEddie/dotnet-spark-article.git 

For this demo we will use the latest hour price paid data from the uk so download the latest data into the running docker container:

cd ~
wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-monthly-update-new-version.csv

The next thig is that we need to download the nuget packages that the demo solution uses but we want to restore them to a known path because we need to reference the jar files inside the Microsoft.Spark package:

cd ~/dotnet-spark-article/HousePrices-Core
dotnet restore --packages ./packages

You should now have the spark driver jar in: “~/dotnet-spark-article/HousePrices-Core /packages/microsoft.spark/0.3.0/jars”

Lets make sure we can build our demo app:

dotnet build

So what have we got?

We should have, a version 8 jre, the dotnet 2.2 sdk, a demo project, the dotnet spark driver (from Microsoft.Spark nuget package) and a csv file with a load of house prices from the uk.

Let’s submit our dotnet app to spark and get a result:

spark-submit --class org.apache.spark.deploy.DotnetRunner --master local ~/dotnet-spark-article/HousePrices-Core/packages/microsoft.spark/0.3.0/jars/microsoft-spark-2.4.x-0.3.0.jar  dotnet run  ~/pp-monthly-update-new-version.csv

What this does is start spark using spark-submit then tell spark to start the dotnet driver “–class org.apache.spark.deploy.DotnetRunner” then we will use a local instance of spark “–master local” then we tell spark where to find our driver “~/dotnet-spark-article/HousePrices-Core/packages/microsoft.spark/0.3.0/jars/microsoft-spark-2.4.x-0.3.0.jar”, what we want the driver to launch “dotnet run” and we pass the path to the csv to our app as the first argument “~/pp-monthly-update-new-version.csv”.

Hopefully, this outputs the average house prices over the last few years, if you want to see this work on a larger dataset then, instead of the last months data you can get the historical data by visiting https://data.gov.uk/dataset/4c9b7641-cf73-4fd9-869a-4bfeed6d440e/hm-land-registry-price-paid-data