Data Friction
Posted
Every year I complete a non-resident UK tax return. On paper. I mail it to the UK and someone somewhere types the numbers I provided in to calculate my tax. Much of the data I put on the return I calculated in a spreadsheet: electronic becomes paper becomes electronic.
If you live in the UK you can file online, but for unclear reasons that option has never been available to non-residents. You can pay to use third party software to submit electronically, but why would you? You have no incentive. Succinctly, the data has ‘friction’: while it is correct it is not structured in a way that easily works for the receiving party to automatically consume.
Friction abounds in data: both within organizations but especially between them:
- a public companies annual report, with financial statements embedded in a table within a PDF
- CSV or Excel files published as open data without definition of schema or semantics of update
- data published in HTML
In some cases, data is actively obfuscated to add friction: for example, social media companies want search indexes like Google to rank them but don’t want others to see their data. This tortuous position leads to a constant fight between desired crawling and undesired scraping (the technical difference between the two being very fine indeed).
In most cases though, the issue is far more simple: the creator of the data lacks incentive, skills, time or knowledge to make the data easy for a machine to consume: it’s easy to drop a table into a word document and save as PDF. Much more work to make the table machine readable with associated schema.
How might you resolve this? Sadly, I don’t think there’s any near term panacea, but there are clear approaches we can consider:
Provide traffic in return
Big sites such as Google, Bing, Twitter and LinkedIn all incentivize website owners to structure their data to improve search ranking and/or click through. In doing so they are able to build a far deeper set of structured data from the web than pure scraping could achieve. That data is available to anyone else willing to scrape or use, for instance, Common Crawl.
Provide a service in return
If part of the challenge of publishing data is knowing the right way to structure it then services like WikiData and Wikipedia (via DBpedia) provide storage, taxonomy, structure and more for anyone who wishes to contribute data.
Amazon and eBay achieve something similar for physical goods: mandating or encouraging provision of structured data to their service in return for listing in their marketplace.
Another more distant example is Open Table who are publishing restaurant recovery data. Restaurants contribute their data to get reservations, patrons contribute their reservations and Open Table publishes aggregated booking data.
Regulate
Government is in a perfect position to insist data in created in a machine readable manner. A long standing example is the SEC which mandates that US company filings are submitted to the Edgar service.
For business to government interchange, it’s reasonable to set standards and let commericial vendors provide solutions to help business comply. For consumer to government that approach runs the risk of disenfranchising those who can’t afford middle-man solutions. Sure, government can mandate free versions but the TurboTax saga shows the risks there. Better for government to invest in their own portals to capture structured data.
Provide payment in return
The most direct approach is monetization: you give me structured data and I give you money. While that works in most aspects of life it hasn’t gained traction with data without intermediaries like Refinitiv or Moodys. Why is that? There’s likely a couple of reasons:
- The difficulty of enforcing property rights. If I sell structured data to party ‘A’ it takes legal enforcement to prevent unauthorized reuse by party ‘B’ taking data from A. DRM doesn’t really exist for data.
- Much of the data is considered exhaust by the creating entity. Building the internal business case to monetize that exhaust can be difficult and viewed as a distraction.
The solution, in theory, is data marketplaces. These have long been proposed and tried. Recent attempts include Snowflake Data Marketplace and AWS Data Exchange. By providing compute alongside the data, the property rights issues can be somewhat, if not completely, managed. It remains to be see if this latest generation will succeed.