Facts, Reference & Masterdata

B17 at the WWII museum in New Orleans

I connected with an ex-colleague this weekend who not only was gracious enough to share that he found the blog useful but also used some of the articles to explain details to others. With that in mind, here’s something I wrote last year that might have wider applicability.

A common confusion in the data space is the difference between facts, reference and master data. To some degree, this is our own fault since the terms are often used interchangeably. For this post I thought I would try and demonstrate the difference, with domain specific examples that, hopefully, help to distinguish the three. To close, I’ll try and give a simple field guide for spotting them in the wild.

With that out of the way, let’s first tackle reference data. This is data that typically describes a lookup, valuable to defining some characteristic of another piece of data. Some common examples might be lists of:

If the data item is shown as a drop-down list, or radio buttons it’s likely reference data. Reference data is typically a slowly changing dimension. New countries do occur from time to time, but that is infrequent. We might add a new value to a status field, but rarely would we remove one or add 100 new. Reference data can be a flat list or hierarchical (a taxonomy) or perhaps, something more complex, an ontology.

The fundamental value of reference data is to permit, via standardization, comparison across domains of information. If we have two data sets, LNG liquefaction terminals and gas power stations, using the same reference data for location, we can easily compare by country or region. If they don’t, we place burden on the consumer to make sense of it.

What distinguishes master data from reference data? Fundamentally, it’s the mutability of the data: Master data inherently changes over time. The list of LNG liquefaction terminals, along with their status, is constantly changing as new facilities are brought online and old retired.

Like reference data, the value of master data is that if we all use the same set we permit easier modelling and comparison. We also remove potential for duplication both of data and effort. For this reason, customer and product lists are classically thought of as master data: If we want a single view of the customer it’s important that every piece of data we have on them ties to the same record for the customer.

That mutability over time means that master data is typically more voluminous than reference data. This, in turn, makes clear identifiers important for uniquely referencing a particular entity on the master data list.

Everything that is not reference or master data is facts. Facts typically change over time and are often numerical. The maximum output of a power station is a fact: it changes over time as the owner invests in and upgrades it. The real time output is also a fact, albeit one that changes much more rapidly. If it’s a time series, it’s highly likely facts. Invoices, pay slips, utility bils, all are facts referencing master and reference data.

Where this can get confusing is one person’s facts can be another’s master data. For example, in a CRM system you might have the current revenue of your customer. If you are in the business of tracking company revenues, revenue is a fact, something you track, manage, and sell. For someone else, it’s just a piece of our master data displayed to help contextualize the account.

That point hints at how to distinguish between the three in the wild as well as the inherent overlap. If:

Another rule of thumb: the more it changes, the more time we collectively spend on it, the more likely it is facts.

Hopefully that helps!