More on Semantics & Databases

Sunrise over boats at Pikes Bay Marina

On the back of my post on the lack of semantics in regular databases, this week I joined Juan Sequeda and Tim Gasper on Catalog & Cocktails to talk about the concept in more detail. My thanks to them both for hosting me.

For this post, I thought I would explore how such a capability might work. In my mind, the fundamental idea is to build a link between the data dictionary in the database and RDFS/OWL definitions of semantic types.

On the podcast, Juan called it ‘See Also’. So let’s use that here. Imagine we have a table with five columns:

    (ID int, City int, HighTemp float, LowTemp float, Reading datetime)

We can already add comments to a column:

alter table readings 
  MODIFY COLUMN City int COMMENT "FK to Cities table",
  MODIFY COLUMN HighTemp float COMMENT "Highest temp in celsius recorded that day",
  MODIFY COLUMN LowTemp float COMMENT "Lowest temp in celsius recorded that day",
  MODIFY COLUMN Reading datetime COMMENT "Date of the temperature readings";

We can also retrieve comments from the data dictionary (see SQL Fiddle for an example in MySQL):


All long standing database behavior: importantly, accessible to client systems, but just as importantly, having no meaning to a machine.

Imagine, if we had a ‘seeAlso’ that permitted us to add the URI to an ontology on a column:

alter table readings 
  MODIFY COLUMN HighTemp float SEEALSO <>,

Just like comments, the seeAlso could be added incrementally to an existing schema and would be accessible through DDL queries.

The use of a URI means that any ontology can be referenced. From the database perspective, the URI doesn’t change anything, it’s just another field in the data dictionary.

Clients, however, can use the seeAlso to compare two columns: are they semantically the same? Further, if the URI is web accessible, they can retrieve/cache the definition (shown here in psuedo turtle but imagine as part of an ontology):

    isa temperature_c
    label, in English "High Temperature (c)"
    never greater than 60
    never less than -60

    never less than -273
    label, in English "A temperature, measured in Celsius"

The richer the definition, the more data quality can be directly measured rather than inferred after the fact. Further, the more unit conversions can be automated by client systems. Want everything in fahrenheit? If the software knows it’s in celsius it can convert for you.

What about projections? I suspect it depends on the operation on the column as to whether the seeAlso still makes sense. The resulting column would be either retained or removed based on the form of projection, with some clear rules needed to describe the behavior. Likely the simplest rule is to drop the seeAlso if the underlying type is changed by the projection.

For example, a mathematical operation would retain the seeAlso of the underlying column:

select HighTemp +4 from readings

Whereas a type change would remove the seeAlso:

select cast(HighTemp as CHAR) from readings

Clearly, a lot more thought required here. Key to this being really useful downstream would be standardized support in ODBC and JDBC (for instance adding getColumnSeeAlso to ResultSetMetaData in JDBC). Even without library support, though, the inherent nature of DDL means that a client could read the seeAlso definitions without the connection library knowing about it, albeit in a vendor specifc manner.

What do you think? Would this be useful to you? If so, please comment on Twitter (while it survives…) or LinkedIn.

If you’re a database vendor that would like to add this to your product, please feel free to embrace these ideas (I only require attribution) and/or get in touch. Wouldn’t it be great to unify the semantic and relational worlds?