You don't need XML any more

Stylzed XML header

I wrote a version of this post maybe six years ago. Revisiting it this morning, having read the state of the graph it’s just as relevant now as it was then. So, quick question. Which of these two do you prefer?

{
     "name": "p",
     "type": 1,
     "contents": [
     {
          "name": "cdata",
          "type": 2,
          "contents": "This is"
     },
     {
          "name": "a",
          "type": 1,
          "contents": [
          {
               "type": 2,
               "contents": "mixed content"
          }
          ],
          "attr": {
               "href": "nonodename.com"
          }
     },
     {
          "type": 2,
          "contents": ". JSON can't easily do this"
     }
     ]
}
or
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<p>This is <a href='nonodename.com'>mixed content</a>. 
JSON can't easily do this</p>

JSON? You’re clearly some sort of masochist… but for many, XML is a legacy format that can and should be dumped.

The reality is that XML (and it’s vastly more ambitious precursor, SGML) have one huge advantage over most data transfer formats out there: mixed content models. A mixed content model allows the capture of structure within otherwise free text and are the basis on which HTML, and thus the modern Internet are grounded.

The difference is between a markup language and object serialization: In the example above you see a hyperlink, denoted by the a element within a paragraph p. The mixed content model allows us to capture which elements are allowed within an element (in this case p) without specifying the order that they appear. Text, called Character Data or CDATA in the spec, is treated as another element.

Imagine an HTML page without hyperlinks, bold, italics, bullet lists….that’s how important mixed content models are. Other formats like Avro or JSON can capture the same structure, but as you see above, so tortuously as to be unreadable or unwritable by a human and complex to process in code.

Machine generated information rarely has this mixed data form: it’s generally object serialization which XML can do but JSON is better suited to. But for full text, XML and other languages that support mixed content models, like HTML and Markdown, will be around for an awful long time