Making a graph unique

Volcanic lake on big island

This is the second in a series of posts I had initially written for an internal audience but might be useful to others

In my first post on the graph, I explored how the choice to only store three columns (subject, predicate & object), while potentially repetitive, permits us to store anything we can boil down to a set of facts.

However, this single three column table has a challenge: how do we avoid clashes between rows of data? If we just use numerical identifiers, at some point, there will be a collision between two numbers used as identifiers for two subjects. (In a traditional multi-table schema this problem is solved by the identifier being used within the context of a table).

The solution is deceptively simple and incredibly powerful. Instead of a number or string, use a Uniform Resource Identifier (URI). These are closely related to (and often look like) Uniform Resource Locators (URLs). Indeed, a URL is a URI but a URI doesn’t have to be a URL. The linked Wikipedia pages give far more detail on both, but to simplify, a URI creates uniqueness by adding hierarchy to the identifier.

To explain, let’s take a very artificial example: if my employee number is 5220, that’s not very unique. Plenty of things are numbered 5220..Let’s add some layers of uniqueness:

Re-writing the table of facts from the previous post using this scheme we might get:

Subject Predicate Object
http://nonodename.com/ld/well/UWI100143608517W600 Depth 1400’
http://nonodename.com/ld/well/UWI100143608517W600 Latitude 45.32423
http://nonodename.com/ld/well/UWI100143608517W600 Longitude -50.23143
http://nonodename.com/ld/pa/PCAAS000/2022/06/07/13:48:41 Price $125.795

Note the use of (bolded) date within the price assessment identifier to guarantee uniqueness. In using hierarchy like this we guarantee uniqueness but at a cost:

That second point is why we chose the domain name ‘permid.org’ when we launched Thomson Reuters open data portal. That choice was vindicated a few short years later when Refinitiv was sold. Owning company changed, domain name for data stayed the same.

This use of URIs is how the RDF standard – on which we build knowledge graphs – solves the problem of uniqueness. In doing so, it permits data from many places to be mixed, safe in the knowledge that, as long as everyone follows some simple rules, there will be no clashes. As an example, here’s multiple providers identifying S&P Global:

RDF takes this further, applying the same thinking to the predicate and object columns. That’s for the next post.