Abstraction cost

Posted Dec 11, 2021

Back in September I posted about data normalization. A follow-up on abstraction cost was promised: this is that post. In the sense of data, abstraction is often a form of indirection, which immediately reminds us of the Butler Lampson aphorism

All problems in computer science can be solved by another level of indirection

and the corollary

…except for the problem of too many layers of indirection

An inherent feature of any form of normalization is that information in a source domain must be mapped to the destination ’normalized’ domain. To take a simple example, a date might be expressed in day/month/year format and we standardize on year/month/day to simplify sorting.

In many cases normalization is lossless. No information is lost when we represent a date in a different format.

There are also lossy normalizations. Imagine the date in question is a transaction timestamp. We have multiple sources, some of which capture the timestamp down to the millisecond, some to the second, some just to the day. If we wish to normalize this data we have a choice to make. Do we:

truncate all timestamps to the precision of the common denominator
pad out truncated source timestamps with synthetic information to the match the most detailed precision (for instance, if we only have day precision, pick the time halfway through the day)
store the source precision of the timestamp alongside the data

Each has pros and cons. With the first option we choose to loose data. With the second, we accept that we’re making things up and this may impact sorting. With the third we lose nothing but give up homogeneity and complicate recall.

Normalization of authority becomes even more complex. We must ensure that the destination domain can semantically describe every possible value found in the source domains.

A simple example is country. If we have chosen to represent country in our destination domain using the IOC country code then, when we find Antarctica in a source domain, we have a problem. While ISO has a code for the Southern most continent: ATA, the IOC does not. Do we leave it blank? Change the authority for the destination domain?

The answer depends on the business need but the problem is clear: in normalization we must sometimes trade precision for compatibility, abstracting away differences between source data for the benefit of homogeneity and recall.