Extending JSON Transformation Support in Azure Data Factory

Posted Aug 22, 2021

One of the powerful features of Azure Data Factory is the extensive JSON support. Combined with the Copy Activity, many different types of JSON file can be mapped to a tabular sink.

However, there are limits. Consider the following

 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

                                            

example: class="lnt"> 1 data-lang="JSON">{ "StartAt":0, "Total":3, "Issues":[ { "ID":1, "Values":[ { "Key":"Title", "Value":"An issue" }, { "Key":"Date", "Value":"2021-08-03" } ] }, { "ID":2, "Values":[ { "Key":"Title", "Value":"A second issue" }, { "Key":"Date", "Value":"2021-08-13" } ] }, { "ID":3, "Values":[ { "Key":"Title", "Value":"A third issue" }, { "Key":"Date", "Value":"2021-08-23" } ] } ] class="p">}

Here, the API designer has designed for a backend that can accomodate different fields on issues by abstracting each field into a separate object.

From a tabular perspective this schema maps most naturally to a highly normalized ‘key/value’ table. If we’re processing this JSON in Data Factory we really have two options:

Map the JSON into a key/value table and pivot the data later
Write custom logic to parse this json to a more natural tabular form

For the second option we can either

use the built in Data Flow transformations, or
break out to an external transform

Depending on the size of the JSON file these options can be overkill: since they are largely focused on large volume transform using Spark or Databricks. Neither were optimal for our use case.

What we really want is an XSLT or XQuery like language for JSON that permits us to functionally express a simple transform of the JSON. After a little research on Stack Overflow, I stumbled on the Jsonator package.

If you’ve not looked at it, I encourage you to. While the syntax is a little terse, it’s incredibly expressive. The example above can be transformed to this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


[
  [
    {
      "id": 1,
      "v": {
        "Title": "An issue",
        "Date": "2021-08-03"
      }
    }
  ],
  [
    {
      "id": 2,
      "v": {
        "Title": "A second issue",
        "Date": "2021-08-13"
      }
    }
  ],
  [
    {
      "id": 3,
      "v": {
        "Title": "A third issue",
        "Date": "2021-08-23"
      }
    }
  ]
]

With a one line expression:

Issues.[{"id":$.ID,"v":Values{Key:Value}}]

Better still, there is an online environment (ala regex101) that permits rapid experimentation to find the transform that works.

The only remaining question is how to integrate Jasnator into Data Factory? Jasnator is a Javascript library, so the natural solution is to host in Node, packaged as an Azure Function. Here’s an example in Typescript:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


import { AzureFunction, Context, HttpRequest } from "@azure/functions"
import * as jsonata from "jsonata"

const convertJSON: AzureFunction = async function (context: Context, req: HttpRequest): Promise<void> {
    const transformExpression:string = req.body && req.body.transform;
    const inputBlobName:string = req.body && req.body.blobName;
    let responseMessage = ""
    let responseCode = 200
    try{
        if(req.method !== "POST"){
            throw new Error("Expect JSON passed on the POST method")
        }
        if(inputBlobName === undefined || inputBlobName.length < 1){
            throw new Error("No input blob name found")
        }
        if(transformExpression === undefined || transformExpression.length < 1){
            throw new Error("No transformation expression found")
        }
        const expression = jsonata(transformExpression);
        const result = expression.evaluate(context.bindings.inputBlob);
        context.bindings.outputBlob = result 
        context.log('Processed a request.');
        responseMessage = "Wrote to output blob: "+inputBlobName;
    
    } catch (e){
        responseMessage = e.toString()
        context.log('Failed a request: '+responseMessage);
        responseCode = 500
    }
    context.res = {
        status: responseCode,
        body: responseMessage
    };

};

export default convertJSON;

The function has input/output bindings of HTTP and BlobStorage. Pass the name of the blob and the transform to execute as a JSON POST payload to execute the transform.

1
2
3
4


{ 
    "blobName": "Example.json",
    "transform":"Issues.[{'id':$.Id,'v':Values{Key:Value}}]"
} 

The transformed file is written to output bound Blob Storage from where it can be picked up for onward processing within Data Factory.

I found this approach to be very usable and quick to iterate. I hope you find it useful.