On Master Data

Posted Feb 6, 2022

Overlooking Pheonix, AZ from Piestewa Peak

Master Data is the core data about the things your business does. If you’re a retailer, the lists of customers and products might be master data while purchase history is transactional data linking the two. Of course, not all lists are master data. Reference data is the taxonomies that classify other data. For example, a list of product types is reference not master data. As ever, Wikipedia does a good job of explaining the differences.

From time to time I’ll encounter folks who believe that the goal is for master and reference data management to be centralized into a single system. It’s rare that you’re building the enterprise from scratch, and rarer still to be using one application to run the business. Hence, for all but the most trivial of enterprise, centralization can only be a goal or direction. But is it a smart one? I feel strongly that it should not be. Putting all our master data eggs into a single basket creates a single point of failure and a choke point for change, forcing the enterprise to move at a single speed.

Further, most master data has some kind of workflow for creation. That workflow differs by type of business object: the workflow to onboard a new hire is very different than onboarding a new product. Do we want to converge all business processes as well?

None of this should be read as endorsement of system proliferation. Rather, finding the sweet spot where the master data is being managed with the right number of systems and that number is very likely > 1.

Given that master data is not centralized, we need to make it easy for a federation of systems to ingest and record transactions that involve master data that they don’t manage.

For example, our retailers point-of-sale (PoS) system might be the master for customers and products but not employees. Yet a sale needs to be recorded linking the three: “employee X sold product Y to customer Z”. The PoS system needs to be aware of the canonical employee master data even if it doesn’t directly manage it. When considering the enterprise architecture to support this reality there are clear steps we can take to simplify these integrations.

Every master data record should have a unique identifier. That identifier must be immutable (e.g. it refers to the same thing over time) and should never collide (the potential to reference multiple things).
Each system of record should make it embarrassingly easy for other systems to consume the master data it manages. That doesn’t mean it exposes everything. The HR system can expose the list of employees with quality identifiers without disclosing their salary. Every hindrance to consumption here creates friction and opportunity for proliferation.
Finally, there must be sufficient organizational rigor that this federated data model is embraced and understood. Proliferation of master data happens for many reasons, not least ignorance of other sources.

Each of these points deserve a more thorough post. For now I’ll point you at this white paper on the first which I think captures the recommendations well.