Extending JSON Transformation Support in Azure Data Factory
Posted
One of the powerful features of Azure Data Factory is the extensive JSON support. Combined with the Copy Activity, many different types of JSON file can be mapped to a tabular sink.
However, there are limits. Consider the following example:
|
|
Here, the API designer has designed for a backend that can accomodate different fields on issues by abstracting each field into a separate object.
From a tabular perspective this schema maps most naturally to a highly normalized ‘key/value’ table. If we’re processing this JSON in Data Factory we really have two options:
- Map the JSON into a key/value table and pivot the data later
- Write custom logic to parse this json to a more natural tabular form
For the second option we can either
- use the built in Data Flow transformations, or
- break out to an external transform
Depending on the size of the JSON file these options can be overkill: since they are largely focused on large volume transform using Spark or Databricks. Neither were optimal for our use case.
What we really want is an XSLT or XQuery like language for JSON that permits us to functionally express a simple transform of the JSON. After a little research on Stack Overflow, I stumbled on the Jsonator package.
If you’ve not looked at it, I encourage you to. While the syntax is a little terse, it’s incredibly expressive. The example above can be transformed to this:
|
|
With a one line expression:
Issues.[{"id":$.ID,"v":Values{Key:Value}}]
Better still, there is an online environment (ala regex101) that permits rapid experimentation to find the transform that works.
The only remaining question is how to integrate Jasnator into Data Factory? Jasnator is a Javascript library, so the natural solution is to host in Node, packaged as an Azure Function. Here’s an example in Typescript:
|
|
The function has input/output bindings of HTTP and BlobStorage. Pass the name of the blob and the transform to execute as a JSON POST payload to execute the transform.
|
|
The transformed file is written to output bound Blob Storage from where it can be picked up for onward processing within Data Factory.
I found this approach to be very usable and quick to iterate. I hope you find it useful.