Making a graph unique
Posted
This is the second in a series of posts I had initially written for an internal audience but might be useful to others
In my first post on the graph, I explored how the choice to only store three columns (subject, predicate & object), while potentially repetitive, permits us to store anything we can boil down to a set of facts.
However, this single three column table has a challenge: how do we avoid clashes between rows of data? If we just use numerical identifiers, at some point, there will be a collision between two numbers used as identifiers for two subjects. (In a traditional multi-table schema this problem is solved by the identifier being used within the context of a table).
The solution is deceptively simple and incredibly powerful. Instead of a number or string, use a Uniform Resource Identifier (URI). These are closely related to (and often look like) Uniform Resource Locators (URLs). Indeed, a URL is a URI but a URI doesn’t have to be a URL. The linked Wikipedia pages give far more detail on both, but to simplify, a URI creates uniqueness by adding hierarchy to the identifier.
To explain, let’s take a very artificial example: if my employee number is 5220, that’s not very unique. Plenty of things are numbered 5220..Let’s add some layers of uniqueness:
- “Employee/5220” is a bit better, but if multiple orgs use the same scheme we’ll still have clashes.
- “nonodename/Employee/5220” is even better, but there is no guarantee another company might not foolishly(!) use “nonodename". What we need is something we own, to place a stake in the ground.
- What many (but not all) URI schemes use is the domain name. If we make the identifier “nonodename.com/employee/5220” we’ve got quite close to unique.
- Finally, if we want to guarantee it, we can add the http prefix to form a URL: “http://nonodename.com/employee/5220” ensuring that that page resolves to something.
Re-writing the table of facts from the previous post using this scheme we might get:
Subject | Predicate | Object |
---|---|---|
http://nonodename.com/ld/well/UWI100143608517W600 | Depth | 1400’ |
http://nonodename.com/ld/well/UWI100143608517W600 | Latitude | 45.32423 |
http://nonodename.com/ld/well/UWI100143608517W600 | Longitude | -50.23143 |
http://nonodename.com/ld/pa/PCAAS000/2022/06/07/13:48:41 | Price | $125.795 |
Note the use of (bolded) date within the price assessment identifier to guarantee uniqueness. In using hierarchy like this we guarantee uniqueness but at a cost:
- The identifier becomes necessarily longer. That can burn storage. However, storage is cheap and much of each identifier is the same, which makes the data very amenable to efficient compression.
- The use of a domain name locks us in. If the company changes name, we must keep the old domain name around, potentially indefinitely, to prevent it being used by someone else.
That second point is why we chose the domain name ‘permid.org’ when we launched Thomson Reuters open data portal. That choice was vindicated a few short years later when Refinitiv was sold. Owning company changed, domain name for data stayed the same.
This use of URIs is how the RDF standard – on which we build knowledge graphs – solves the problem of uniqueness. In doing so, it permits data from many places to be mixed, safe in the knowledge that, as long as everyone follows some simple rules, there will be no clashes. As an example, here’s multiple providers identifying S&P Global:
- https://www.wikidata.org/wiki/Q868587
- https://www.google.com/search?kgmid=/m/02_nl4
- https://permid.org/1-4295904495
- https://dbpedia.org/page/S%26P_Global
RDF takes this further, applying the same thinking to the predicate and object columns. That’s for the next post.