80% of data science

Posted Jan 10, 2021

There’s a truism in Data Science that 80% of the effort is prepping data. In my experience, that’s frequently the case. It sucks: you’re trying to get to value and you’re spending way too much of your time wrangling data to get it where you need it. The same is largely true of data exchange between parties: the sender will do an export from their system, kick over a CSV and then the receiver must figure out how to consume and import the data before delivering business value.

So, why is this? To me, assuming the data is of fundamentally sound quality, (that’s a whole other post) it comes down to three issues:

Structure

What structure is the data in? JSON? XML? CSV? Relational? What character encoding? ASCII?, UTF-16? Is all the data you need in a single file? Partitioned across multiple files? Perhaps there is some relational structure, with a fact file and a bunch of dimension files.

There are a myriad of options here which may, or may not, be described by the source and each impacts how to parse and ingest the data you need. Beyond raw structure, one must also figure out how big data elements can be. If one ‘column’ of strings is never greater than 50 characters does that mean it can never be more than 50 characters or you’ve just not got an example? Will all the numbers fit into an 32 bit integer, or do you need 64?

Syntax

Once you’ve started to read the data in you need to deal with syntax of how data is represented. Dates, for example, have numerous string representations despite the best efforts of standards bodies.. Is it day or month first? Which time-zone is the time? Each of these must be discovered and implemented.

Semantics

Once you’ve figured out how all your data is represented and successfully loaded it you have to deal with perhaps the hardest challenge: what does the data actually mean? Even seemingly simple columns can have complex semantics.

A column called ‘credit score’ might seem obvious (and therein lies plenty of opportunity for assumption) but immeadiately prompts questions:

what period or date is the score for?
is it an aggregate or from a specific agency?

Can’t we do better?

Of course we could. There are solutions for all of these problems:

for structure and syntax, the sender could publish a JSON or XML schema. Even CSV has a standard that the sender could choose to abide by
for semantic, the obvious choice is the RDF suite of standards, particularly OWL and SHACL

Why don’t we do we? Well that’s a deep cultural and incentive problem that I’ll address in a future post.