Semantics revisited

La Plata Station, MO

Last year I wrote about my dissatisfaction with the current state of semantic layers. The TL:DR being that if all you’re doing is tagging columns and tables with strings you’re not really helping with semantics.

Last year saw an explosion in RAG systems that, with a simple architecture and a decent LLM permit chat bots that do a good job of interacting with our textual content.

It’s only natural that the industry should turn to tabular data, since that’s where a lot of business information is locked up and SQL remains difficult for end users. The big database engine players are all working on this, for example Snowflake with Cortex Analyst, Databricks recently showed a solution running over financial data and Microsoft have been investing in synonyms to improve the Q&A feature of PowerBI.

What each of these solutions require is additional metadata to drive the feature, be it synonyms in PBI, comments for Databricks or a YAML file for Snowflake.

This shouldn’t be surprising. Generating SQL from just the table and column names is bound to fail for anything but the most trivial schema as there is far too much ambiguity in what the names mean.

To recap: each vendor is building their own standards for semantics to power natural language query. That is a new layer of lock in. While we’re gradually solving the format wars of iceberg and deltalake, we’re building a new set of imcompatibility over the top.

Worse, each solution is based on strings not things: the same issues of interopability that I wrote about last year are there and we are not solving for non English speakers, except by adding another layer for error translating the end user’s query to English before we figure out the SQL.

You know what

The solution is staring us right in the face but we engineers love to reinvent the wheel rather than research, learn and leverage something we don’t understand.

There next couple of years are going to be annoying. Where’s the fast forward button on this?