What's with a knowledge graph?

Screen grab of Google Trends graph for the term 'Knowledge Graph'

The phrase Knowledge Graph has been gradually increasing in use ever since Google popularized the term to describe their repository of structured knowledge back in May 2012. I still encounter a lot of confusion about what a Knowledge Graph is and why they are worth investing in.

In my view, Knowledge Graphs play a critical role in connecting data. To start, let’s consider what a graph is and why it differs from traditional forms of data representation and storage. In my mind, there are two fundamental differences, grounded in the idea that a graph should be able to store anything we want without pre-existing configuration.

For the rest of this post, let’s discuss the shape of data: Consider how we normally structure data, in tables, as the Google Trends data above:

Month “Knowledge Graph”: (United States)
2012-05 100
2012-06 14
2012-07 6
2012-08 10
2012-09 7
2012-10 9
2012-11 4
2012-12 10

The data has two columns, month and “Knowledge Graph”: (United States) which, looking at the trends we learn is the interest – out of 100 – for that month.

We can easily add other data points by adding additional columns (notice the values rebase to accommodate the far greater interest in Data Science as a search term)

Month “Knowledge Graph”: (United States) “Data Science”: (United States)
2012-04 <1 9
2012-05 13 8
2012-06 2 6
2012-07 1 5
2012-08 2 10
2012-09 1 16
2012-10 2 12

Representing data in tables such as these works well for human interpretation, but has some implications:

These limitations mean that it is not possible for table-based systems (the pre-eminent data storage model of today) to store arbitrary data without some degree of design ahead of storage to create a schema of collaborating tables. In a relational database this is strictly enforced (the data must fit the schema). In big data systems such as Hadoop the table is defined as data is stored and then relationships between them managed at query time.

If ever wider tables are not a workable answer, what is?

Well, what if the tables are only ever three columns, guaranteed? Can we model and store any arbitrary data in that form? It turns out we can, and that’s the underlying storage model for a knowledge graph: one fact per row. The three columns are:

Considering the month of May & June 2012, above, we might represent the search interest for the two topics in the following way:

Subject Predicate Object
Search_Interest_2012_05 KG_Count 13
Search_Interest_2012_05 DS_Count 8
Search_Interest_2012_06 KG_Count 2
Search_Interest_2012_06 DS_Count 6

A first glance this might seem like a very repetitive way of storing & representing data. That’s correct. But consider the benefits:

Subject Predicate Object
UWI 100 14 36 085 17 W6 00 Depth 1300’
UWI 100 14 36 085 17 W6 00 Latitude 44.32423
UWI 100 14 36 085 17 W6 00 Longitude -51.23143
PCAAS000_2022_06_07_13:48:41 Price $115.795

These three columns of subject, predicate and object are how a knowledge graph can store any data without pre-configuration. The three values are also the underpinning of the RDF standard for data, which deals with how to make any row globally unique. That will be the subject of the next post in this series.