Last month I wrote about master data. As promised, here’s a follow up on identifiers. My apologies, by the way, for the delay of post: a combination of vacations and job transition left me short of time in February.
At its most basic, an identifier is a string of characters that label a unique instance of something within a wider class. The identifier is used across systems and processes to reference the underlying entity which might be something physical - such as an airplane - or less tangible, perhaps a bank account. As ever, Wikipedia has a good primer along with some examples of frequently used identifiers. What makes a good one?
A key quality of identifiers is that they unambiguously reference something. For this to work, the identifier must be unique within the set of things we want to reference. For example. the identifier ‘Dan’ is unique within my family but would be a lousy if used more widely.
From a data perspective, the best identifiers are globally unique, that is a set of characters that we can guarantee will not reference anything else. In computing, the best example of something globally unique is the UUID which is algorithmically guaranteed to never repeat.
Somewhere between ‘Dan’ and a UUID is the VIN which uniquely identifies a particular car but is not designed to reference anything else.
There is a natural tension here: to guarantee uniqueness we need to embrace longer identifiers which are less comfortable and relatable for humans. The UUID exists at one end of this continuum, my name, the other.
Identifier schemes must balance these conflicting demands. The right answer depends much on how frequently the human will be in the loop. A UPC is a lot to type accurately and quickly, but the bar code makes that need rare.
The best identifiers only ever refer to one thing. They are immutable over time. Putting aside concerns about the Ship of Theseus, in this sense a VIN is good. Over all time, it only ever refers to one specific vehicle. By contrast, the license plate on the car is lousy: it can be reallocated, bought and sold. To uniquely identify a vehicle by license plate we must include the date. By VIN we need no other data.
Thought of this way, a webpage URL is lousy. While it uniquely identifies something on the internet, over time the thing it ‘points’ at can change markedly1 - as any domain squatter will tell you.
Contextual and/or Dereferenceable
1957? If I give you just that number, I could be talking about any number of things. The year, an airplane (either a Boeing 737 or Airbus A320), a movie, a house number… Without context 1957 means nothing. By contrast, 612-555-1957 is just another number, but the formatting gives us context that it is a US phone number. Context can be provided multiple ways:
- formatting convention (such as Social Security Numbers and dates (although consider the ambiguity of US vs EU formatting))
- labelling (for example the column header in a CSV lets us know that all values in that column are of a certain identifier scheme)
- heirarchy (the first section of the Dewey Decimal refers to the top level class of the book)
- positional convention (the VIN has characters in some places, numbers in other, a URI such as a URL has a scheme followed by a colon character)
URLs are good examples of dereferencable identifiers. Not only do they tell the consumer the label but also give them a protocol by which they can retrieve more information about the identified thing. The URL of this page is both a unique identifier of this post but also the instruction of how to get the contents of the post.
Designing a good identifier scheme is an exercise in trying to predict the future while balancing current needs. It’s not easy and everyone will have an opinion. Beyond the qualities above, the best advice is accept that you cannot know everything and build in extensibility from the start.
This is an obvious flaw in NFTs. All that gets encoded on the blockchain is the link to the artwork: as Moxie Marlinspike demonstrated earlier this year. ↩︎