The greatest sin of tabular data
Posted
Consider a CSV file, or an Excel worksheet, or a table in a relational database. They all have one or more columns of data. The column has a name: “actualVolume” perhaps, and if a relational database a type: number, date, string, etc. (neither CSV or Excel really enforce types which is a whole world of different hurt).
In a relational database there may (or may not be) other constraints applied to the column: limiting the values in there or relating the column to another table as a foreign key.
To the trained eye, this column means something: “ah yes, this is the volume of gas flowing through a pipeline measured at a point in time for a particular pipeline identified by another column”. To the untrained eye, we must figure it out. Maybe ask someone.
To a machine, this means nothing. The data in the column has some characteristics that the machine can measure:
- minimum & maximum values
- mean, average, standard deviation
- a distribution (perhaps a normal distribution)
But none of these really tell the machine what the data is. That is left to the observer to interpret.
This is true of most APIs as well. For example, the Platts Oil Inventory API has an endpoint USCrudeExports which, in the return payload, includes a field ‘Volume_MBD’. Helpfully, there is an English explanation:
Volume of forecast in thousand barrels per day
But there is nothing here to help a machine interpret that floating point number. No machine-readable description of the unit of measure, expected distribution or anything else.
Now, to be clear, I’m not knocking that API. It’s better documented and usable than many APIs.
So, why does this matter? It’s widely held that 80% of data science is data engineering/wrangling/munging. We spend too much of our valuable time making sense of the data, converting fields to comparable or joinable values etc. Can we do better?
The accepted answer is ‘implement a data catalog’. That will give us human readable descriptions of the data and the name of a subject matter expert who can answer questions about the data. That helps. Maybe 10% of the 80%.
I want to do more: Imagine if, within the data catalog, you could describe the column not with a comment but by reference to a strong semantic type. You connect “Volume_MBD” to
Type: Volume_MBD
Is a unit of measurement
Is a floating point number
Has an English description: “Volume in thousand barrels per day”
Has a French description: “Volume en mille barils par jour”
Consists of a unit of measure #1: “Thousand Barrels”
Which = 1000 x 1 barrel
1 barrel = 158.987 m3 (SI unit)
Consists of a unit of measure #2: “Day”
Which = 86400 seconds (SI Unit)
Is calculated by #1/#2
Will never be < 0
Will never be > 1,000,000,000
Can be null
Why would we do this? Fundamentally to help solve for the other 70% of data science. If we had machine readable definitions of our data, it follows that we can create functions that will read this definition and create ingest functions for tools like Python that will automatically parse the data, validate, and convert values to standard units.
Further, if the column in question describes a relationship or referential data, such as country or region, these too can be normalized to a standardized dimension, ready to join with other datasets so described.
That could save data scientists a lot of time. But we can go further: these same machine-readable definitions can be used to automatically create internal and external facing APIs, accelerating the build of product.
What’s crazy is that all the building blocks exist to do this today. RDF is the machine-readable language we can use to describe this data. A catalog gives us a connection point, yet few, if any of the catalog vendors have really tackled this. Even recent data storage technology darlings, like Snowflake, ignore this problem. The industry has its collective head in the sand.
We can do better!