Spark and dotnet in a single docker container

I really like the new dotnet driver for Spark because I think it makes spark more accesable to devs who might not know pythpn or scala.

If you want to be able to build and run a dotnet application using the dotnet driver to run locally you will need:

  • jre
  • Spark
  • dotnet sdk

Now the JRE has some docker images and so does the dotnet sdk, but what if you want both in a single container? The simplest way to get this setup is to use both dotnet core image and jre images and create a multi-stage dockerfile. For a working example see:

https://github.com/GoEddie/docker-dotnet-spark/blob/master/Dockerfile

The interesting bits are at the beginning:

FROM mcr.microsoft.com/dotnet/core/sdk:2.2 as core

FROM openjdk:8u212-jre
ADD ./spark-2.4.3-bin-hadoop2.7.tgz /usr/local/

COPY --from=core /usr/share/dotnet /usr/share/dotnet
CMD "ln -s /usr/share/dotnet/dotnet"

This is quite new syntax in docker and you need at least docker 17.05 (client and daemon), after the images “FROM blah” you can specify a name “core” in this case, then later you can copy from the first image to the second using “–from=” on the “COPY” command.

In this dockerfile I have added Spark 2.4.3 and the default environment variables we need to get spark running, if you grab this dockerfile and run “docker build -t dotnet-spark .” you should get an images you can then run which includes the dependencies for dotnet as well as spark.

Once you have run “docker build -t dotnet-spark .” to build the image, you can create an instance of the image by doing “docker run -it dotnet-spark bash”.

You can test spark works by running spark-shell which should give you a nifty spark shell, you can quit that by typing :q and then test dotnet by running dotnet --info.

Lets run an example project to see this all work together, if you clone this project https://github.com/GoEddie/dotnet-spark-article, in the running container do:

cd ~
git clone https://github.com/GoEddie/dotnet-spark-article.git 

For this demo we will use the latest hour price paid data from the uk so download the latest data into the running docker container:

cd ~
wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-monthly-update-new-version.csv

The next thig is that we need to download the nuget packages that the demo solution uses but we want to restore them to a known path because we need to reference the jar files inside the Microsoft.Spark package:

cd ~/dotnet-spark-article/HousePrices-Core
dotnet restore --packages ./packages

You should now have the spark driver jar in: “~/dotnet-spark-article/HousePrices-Core /packages/microsoft.spark/0.3.0/jars”

Lets make sure we can build our demo app:

dotnet build

So what have we got?

We should have, a version 8 jre, the dotnet 2.2 sdk, a demo project, the dotnet spark driver (from Microsoft.Spark nuget package) and a csv file with a load of house prices from the uk.

Let’s submit our dotnet app to spark and get a result:

spark-submit --class org.apache.spark.deploy.DotnetRunner --master local ~/dotnet-spark-article/HousePrices-Core/packages/microsoft.spark/0.3.0/jars/microsoft-spark-2.4.x-0.3.0.jar  dotnet run  ~/pp-monthly-update-new-version.csv

What this does is start spark using spark-submit then tell spark to start the dotnet driver “–class org.apache.spark.deploy.DotnetRunner” then we will use a local instance of spark “–master local” then we tell spark where to find our driver “~/dotnet-spark-article/HousePrices-Core/packages/microsoft.spark/0.3.0/jars/microsoft-spark-2.4.x-0.3.0.jar”, what we want the driver to launch “dotnet run” and we pass the path to the csv to our app as the first argument “~/pp-monthly-update-new-version.csv”.

Hopefully, this outputs the average house prices over the last few years, if you want to see this work on a larger dataset then, instead of the last months data you can get the historical data by visiting https://data.gov.uk/dataset/4c9b7641-cf73-4fd9-869a-4bfeed6d440e/hm-land-registry-price-paid-data

Subscribe

* indicates required

Please select all the ways you would like to hear from Agile Sql Club:

You can unsubscribe at any time by clicking the link in the footer of our emails. For information about our privacy practices, please visit our website.

We use Mailchimp as our marketing platform. By clicking below to subscribe, you acknowledge that your information will be transferred to Mailchimp for processing. Learn more about Mailchimp's privacy practices here.