ADF, XML objects and XPath in the expression language

When you use ADF, there are two sides to the coin. The first is the data itself that ADF does very well, from moving it from one site to another to flattening JSON documents and converting from CSV to Avro, to Parquet, to SQL is powerful. The other side of the coin is how ADF uses data as variables to manage the pipeline, and it is this side of the coin that I wish to talk about today.

The example we will walk through is where we have an XML document that we read using a web service call and we want to retrieve a specific value. Perhaps we need to use that value later on in the pipeline or perhaps we are just glutton for punishment. The XML document looks like:

<FHRSEstablishment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Header>
<ExtractDate>2021-02-12</ExtractDate>
<ItemCount>6</ItemCount>
<ReturnCode>Success</ReturnCode>
</Header>
<EstablishmentCollection>
<EstablishmentDetail>
<FHRSID>1073404</FHRSID>
<LocalAuthorityBusinessID>FP53</LocalAuthorityBusinessID>
<BusinessName>BaxterStorey Ltd</BusinessName>
<BusinessType>Restaurant/Cafe/Canteen</BusinessType>
<BusinessTypeID>1</BusinessTypeID>
<AddressLine1>Siemens Blade Factory</AddressLine1>
<AddressLine2>Alexandra Dock</AddressLine2>
<AddressLine3>Hull</AddressLine3>
<PostCode>HU9 1TA</PostCode>
<RatingValue>3</RatingValue>
<RatingKey>fhrs_3_en-GB</RatingKey>
<RatingDate>2019-03-14</RatingDate>
<LocalAuthorityCode>626</LocalAuthorityCode>
<LocalAuthorityName>Hull and Goole Port</LocalAuthorityName>
<LocalAuthorityWebSite>http://www.hullandgoolepha.gov.uk/</LocalAuthorityWebSite>
<LocalAuthorityEmailAddress>admin@hullandgoolepha.gov.uk</LocalAuthorityEmailAddress>
<Scores>
<Hygiene>5</Hygiene>
<Structural>10</Structural>
<ConfidenceInManagement>10</ConfidenceInManagement>
</Scores>
<SchemeType>FHRS</SchemeType>
<NewRatingPending>False</NewRatingPending>
<Geocode>
<Longitude>-0.29303899407387</Longitude>
<Latitude>53.74618911743160</Latitude>
</Geocode>
</EstablishmentDetail>
</EstablishmentCollection>
<FHRSEstablishment/>

The requirement:

  • Read the XML feed of food venue ratings in Hull: https://ratings.food.gov.uk/OpenDataFiles/FHRS626en-GB.xml
  • From the XML find the name of the element with the name FHRSID
  • From the FHRSID elements, find the one with the value “1073399”
  • From the correct FHRSID, find the sibling element called BusinessName
  • When you have the BusinessName element, return the text of that element

Sounds fun, right?

A sit down with grandpa

Ok, young person, I’m grandpa, and I’m telling you that doing processing like this in ADF is a pain in the butt. If you can do this processing in something like a web service, function, databricks app, whatever - then do that. Do weird painful in ADF at your peril and only if you cannot avoid it.

Back to it

So XML, isn’t it lovely! How do you query XML in ADF? Easy, ADF has two functions that will be useful here:

xml
xpath

The first function, xml, takes some text and creates an XML document from it - xml works nicely when we pass it some XML. It also takes JSON, which has some challenging requirements that we will have to deal with later (stay tuned, it is a real laugh).

The second function, xpath, does what you would expect, it runs an XPath query over the XML document.

XML Namespaces

aggggggggghhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh agggghhhhhhhhhhhhh agggghhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh and once more aagggggghhhhhhhhhhhh XML documents often contain XML namespaces which mean when you do something like run an XPath selector like //ElementName, you actually need to run something like //ns:ElementName - however xml in ADF has no way (that I can see to pass in a namespace) so we need to be a little bit crafty in how we use XPath to query the document and instead of using a nice //ElementName we need to use an XPath function to find the element we need and luckily, we have the full XPath library rather than an ADF re-implemented library. The first thing is finding all of the FHRSID elements:

//*[name()="FHRSID"]

next we need to find the one with the correct value:

[text()="1073399"]

we then need to find the sibling element called “BusinessName”

/../*[name()="BusinessName"]

Putting this all together in a single XPath expression is:

string(//*[name()="FHRSID"][text()="1073399"]/../*[name()="BusinessName"])

Now we have our XPath that finds our business name, namespaces be damned we need to pass the XPath expression to the XPath element:

@xpath(xml(activity('Get Site Map').output.Response),'string(//*[name()="FHRSID"][text()="1073399"]/../*[name()="BusinessName"])')

Get Site Map is defined:

{
                "name": "Get Site Map",
                "type": "WebActivity",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "url": "https://ratings.food.gov.uk/OpenDataFiles/FHRS626en-GB.xml",
                    "connectVia": {
                        "referenceName": "AutoResolveIntegrationRuntime",
                        "type": "IntegrationRuntimeReference"
                    },
                    "method": "GET"
                }
            },

When I run this, I get the output I require and can use it inside the ADF to either pass to another activity or do something else.

If you want to see this in action, hook up this git repo to a empty ADF and you’ll get the codezzzzzzzzz:

https://github.com/GoEddie/XmlParsingInADFExpressions

Enjoy!