Approaches to running Databricks ETL code from Azure ADF
Databricks is fantastic, but there is a small issue with how people use it. The problem is that Databricks is all things to all people. Data scientists and data analysts use Databricks to explore their data and write cool things. ML engineers use it to get their models to execute somewhere. Meanwhile, the cool kids (data engineers obviously) use it to run their ETL code. Some use cases favour instant access to “all the datas”, some favour rigorous engineering discipline so when we look at Databricks it is a case of one size does not fit all.
In this post we are going to focus on how to run code for data pipelines, that is good old fashioned ETL or ELT or whatever, we call them data pipelines, in that we don’t have to wonder whether we are using ELT or ETL or have a typo.
Typical Azure Setup
What typically happens is ADF is used to call Databricks, ADF does something like bringing some data into ADLS or reading from a database and gets the data written to a really expensive dumb data store like “Microsoft SQL Server or Synapse Server” (jokes obviously) and then calls Databricks to do the actual processing.
When we call Databricks from ADF we have a few options for the type of code we can call, we can call:
Notebook Python File JAR
Notebook
A notebook is a single file made up of sections called “cells”, each cell can be run on its own and can store state - none of this is available for ADF, with ADF you run an entire notebook (this is a key point, remember it for later).
Each cell in a notebook can we flipped from one language to another using what is known as a magic command “%language”, for example if you had a python notebook and in a cell wrote:
%sql
update delta_table set what_in_sweet_goodness = 'yes_sql_not_python';
It would execute the SQL command as SQL, not as python (insert mind blown meme, don’t actually because I can’t be bothered to find a link).
With a notebook we also get some objects provided for us, think dbutils and spark. In spark if we want to do anything we first need to get a reference to the current SparkSession so:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('databricks-awesome').getOrCreate()
or even:
%scala
SparkSession.builder().getOrCreate()
but with a notebook we don’t need to do that, we can just start using spark
without instantiating it, nice:
Now, this is great for hacking on a server and seeing what data you have because you don’t have to type spark = ....
or remember to import dbutils if you want to use it, but it is terrible for writing code in an IDE or running tests on the code on your desktop because if you don’t then create the objects that are provided for you, then you won’t be able to use them. You won’t get any intellisense. You will struggle to refactor things, not pleasant for us developers who like to be productive.
So, notebooks are great for exploring data and looking at issues in Databricks but lousy for developers who.
So if you want to write unit tests, you can’t. The best you can do is write integration tests that go typically like “deploy notebooks, run notebooks, check results” and that is going to be a minimum of 2-3 minutes probably longer so painful.
Python File
So this is my preferred (when using python, when using scala or .Net this choice would be obtuse, to say the least).
If you use a plain python file, you will need to instantiate your own SparkSession so the code will run locally. Because it is plain python, you can write unit tests and integration tests and use intellisense, refactoring and everything that every good developer likes to use.
Now the thing you lose is that you can’t just jump onto a notebook, hack about and change the code to see how to make it work but personally I don’t believe that this is a bad thing - adding engineering rigour to your development workflow is not a bad thing.
JAR
If you are writing Java or Scala then use a JAR file - same reasons as python.
.Net
No post I write about Spark is going to miss off .Net, if you are writing C# or F# then you can’t use a notebook anyway. If you do this then you will also include the .Net driver JAR file (weird statement to make) and call that.
Hybrid solution
The last option is to mix things up a bit and use a mixture of notebooks and deploy jars or python eggs/wheels and call the compiled code from the notebook so you get to do things like unit and integration test the code but have an easy way to run the code on a Databricks cluster.
The hybrid approach could well be the best solution for everyone.
Last point
One last thing because I guess it isn’t obvious is that if you use python files or Jars if you want to run one outside of ADF you either need to use a notebook to call it or you can create a job and execute that job, small point but hopefully useful.
Summary
If you want to have an effective development workflow for your Databricks ETL jobs which are called by ADF then use python files or JARS and ideally not notebooks.