The decline of the API for data sharing
Posted
In my previous post on the rise of SQL I explored the longevity of SQL and proposed that new architectures that separate compute and storage give SQL new viability for sharing data between business partners and with customers of data vendors.
For this (provocatively named) post I’d like to explore why I think SQL will supplant APIs as the predominant data sharing model.
It’s important to note that I’m focused only on the use of APIs for machine-to-machine information delivery. In other scenarios, for example where APIs are heavily used for mutating data, APIs will be long lived: it’s a brave producer who gives writeable SQL access to a 3rd party.
Traditionally, there have been two ways of sharing data:
- Structured (or semi structured) delivery using formats like CSV over file transfer protocols such as SFTP
- API delivery
Broadly, cloud architectures give us two more:
- Delivery of formats like CSV over shared storage such as Amazon S3
- Cloud enabled SQL where the consumer pays compute for their query Cloud enabled SQL is an emerging technology. At the time of writing, Snowflake is furthest advanced, with Amazon, Google and Databricks as fast followers.
Where CSV wins is simplicity. Every ETL & BI developer knows how to ingest a CSV file from an FTP like source, most every ETL & BI tool supports ingest of tabular CSV.
That simplicity brings challenges though: is this a full copy of data or just changes? What guarantees of type safety are there? Is there relational integrity across files? Does the SFTP server use some weird authentication that the ETL tool doesn’t support? How recent is the data? CSV and the like are easy to implement but ultimately require considerable work by the consumer to onboard and use.
By contrast, APIs can be rich. Standards such as Swagger, OData and GraphQL provide clear semantics for invocation and query. However, these come at considerable cost of engineering for producer and consumer. Much effort must be put into making the API perform and, the more complex queries that are supported, the more cost there is to support them for the vendor. Ultimately, neither OData nor GraphQL expose query capabilities as rich as the relational algebra of SQL.
Beyond that, there is no standard single implementation for authentication in APIs. oAuth comes closest but has a multitude of implementation options that can be exceedingly difficult to implement, especially from BI tools. Worse, these implementations are often user, rather than machine, centric. In the BI projects I’ve worked on, we’ve typically ended up warehousing data extracted from an API because it was too difficult and brittle to pull the data directly into the BI tool.
Comparing the different techniques, we see that, from a consumer perspective, Cloud SQL has clear benefits for time to value:
CSV over SFTP | API | CSV over S3 | Cloud SQL | |
---|---|---|---|---|
Consumer must warehouse own copy | Yes | Likely | Depends on analytics framework | Only if they wish |
Consumer pays for analytic compute | Yes | No | Yes | Yes |
Type safety (e.g. string in an int column) | No | Probably | No | Yes |
Supports arbitrary query | Yes | No | Yes | Yes |
Exposes data structure/model | No | Depends on API | No | Yes |
Exposes data relationships | No | Depends on API | No | Depends on vendor |
Support for deltas/time travel | Custom | Depends on API | No | Custom in schema or via Delta LakeIceberg |
Integration with ETL/BI tools | Usually | Custom connectors or code | Yes | Yes |
Schema Evolution | No | Artisanally managed | No | Via views, Delta Lake/Iceberg |
Usage Analytics for Producer | None | If added | Basic | Better |
(Note I am simplifying here. While CSV doesn’t provide type safety or data model, parquet does, at least for a single table)
All those variable answers for APIs are the root of the problem. Where SQL semantics are clear and understood by all, every API is different and must be learnt by the consumer. But, there are drawbacks to SQL marketplaces:
Optically, cost of compute is the biggest downside of cloud SQL for consumers. In one sense, this argues for pricing data via SQL marketplaces cheaper than via API. I wouldn’t. The additional value far exceeds any compute cost.
As a CIO, I got nervous about the costs I might be incurring with my team accessing SQL marketplaces. But, on deeper reflection, I got comfortable with it: this is no different than flat file delivery, where warehousing costs must be considered, or many APIs where we ended up making a copy of the data. The consumer is going to pay for compute somewhere, cloud SQL makes it more transparent.
Perhaps a bigger concern is that I may have to contract with a new vendor. If I’m not a Snowflake customer yet and the data I want is only on their marketplace I don’t have much choice but to become one, perhaps on less advantageous terms. This argues for data vendors being agnostic of any one marketplace and, wherever possible, meeting the customer where they are, regardless of cloud vendor.
Happily, competition isn’t standing still. Databricks announced their marketplace mid last year, Google brought Analytics Hub Data Exchange out of preview late in 2022, Amazon support Redsift access in their Data Exchange.
Ultimately, SQL marketplaces seem like a win/win for producers and consumers. Less work for the producer to build and support, less cost to run. Quicker and easier for consumers to adopt. Perhaps the API was just, like so many languages before, an intermediate stop on the road to SQL dominance?