Last year I posted about using Azure Data Factory (ADF) to ingest WikiData into the database. As my knowledge of Power Query (PQM) has continued to grow I realized it might be possible to accomplish the same results using PowerQuery rather than ADF.

On the face of it, this isn’t quite as useful since once data is in PowerBI it’s not really easy to get back out. While that doesn’t seem likely to change, Power Query is gradually gaining new execution environments: the support in ADF is something of a toy currently, but it’s easy to see it developing to something more powerful over the next few years. Microsoft, if you’re listening: a full PQM service that scales from single node to spark based and targets any destination accessible from ADF would be marvellous.

So, with that in mind and a request from a colleague for some help, here’s how to create a country dimension in Power BI using WikiData:

## Getting data out of WikiData

In the ADF flow we used the bulk export of WikiData. That’s not really practical for this use case. That leaves three other options:

1. Export by subject. Given an entity, for example, given https://www.wikidata.org/wiki/Q2321706 one can use the Special:EntityData form to get the underlying RDF, using the Accept header to control the format. In this case, https://www.wikidata.org/wiki/Special:EntityData/Q2321706 This doesn’t always seem totally performant, presumably because WikiData does some conversions on demand.
2. Export using Linked Data Fragments. If you can express what you want in terms of absolute matches on subject, predicate and object (or any combination of the same) the LDF endpoint is fast and easy to use and very REST friendly.
3. Finally, we can use SPARQL. The WikiData Query Service is super friendly as well as RESTy but non trivial queries can end up timing out.

Just like ADF, Power Query doesn’t natively understand RDF, but it happily slurps up CSV, so the trick, as last year, is to represent everything in ntriples and massage from there.

## Wrapping in PQM

Given these sources, here’s some sample PQM functions that expose them.

GetTriplesForSubject takes an entity URI and calls the Special:EntityData page to download all triples for that subject:

let
GetTriplesForSubject = (subject as text) as nullable table =>
let
EntityID = Text.End(subject,Text.Length(subject)-Text.PositionOf(subject,"/",Occurrence.Last)),
Options = [Headers = [#"Accept"="application/n-triples"]],
WikiURI = Text.Combine({"https://www.wikidata.org/wiki/Special:EntityData",EntityID}),
// Convert to CSV, parsing for ' ' as the separator
data = Csv.Document(Web.Contents(WikiURI,Options),{"Subject", "Predicate", "Object", "Terminator"}," "),
noTerminator = Table.RemoveColumns(data,"Terminator"),
cleaned = Table.TransformColumns(noTerminator,
{{"Subject", each CleanNTriple(_), type text},
{"Predicate", each CleanNTriple(_), type text},
{"Object", each CleanNTriple(_), type text}})
in
cleaned
in
GetTriplesForSubject

The function relies on a helper, CleanNTriple, that we’ll cover further down.

GetLabel takes a subject URI and gets the English label for it using the SPARQL service. In theory, one could extract this out of the full set of triples for the subject but this function is way faster. The SPARQL service doesn’t return ntriples but will happily generate CSV.

let
GetLabel = (subject as text) as nullable table =>
let
// Request Ntriples from the LDF Service
Options = [Headers = [#"Accept"="text/csv"]],
FixedSubject = if Text.At(subject,0) <> "<" then Text.Combine({"<",subject,">"}) else subject,
Query = Text.Combine({"SELECT ?label WHERE {
SERVICE wikibase:label {
bd:serviceParam wikibase:language ""en"" .
",FixedSubject," <http://www.w3.org/2000/01/rdf-schema#label> ?label
}
}"}),
url = Text.Combine({"https://query.wikidata.org/sparql?",Uri.BuildQueryString([query=Query])}),
data = Csv.Document(Web.Contents(url,Options)),
promoted = Table.PromoteHeaders(data, [PromoteAllScalars=true])
in
promoted
in
GetLabel 

This function can easily be generalized to make an arbitrary SPARQL call.

The final function is GetAllTriplesMatching. This invokes the LDF service, handling pagination as necessary to get all fragments matching the pattern. The pagination technique is lifted largely from Microsoft’s TripPin tutorial.

let
GetAllTriplesMatching = (optional sub as text,optional pred as text, optional obj as text) as table =>
GeneratePage((previous) =>
let
QueryParams = [],
QPWithSubject = if sub <> null then Record.AddField(QueryParams,"subject",sub) else QueryParams,
QPWithObject = if pred <> null then Record.AddField(QPWithSubject,"predicate",pred) else QPWithSubject,
AllQueryParams = if obj <> null then Record.AddField(QPWithObject,"object",obj) else QPWithObject,
// if previous is null, then this is our first page of data
url = Text.Combine({"https://query.wikidata.org/bigdata/ldf?",Uri.BuildQueryString(AllQueryParams)}),
nextLink = if (previous = null) then url else Value.Metadata(previous)[NextLink]?,
// if NextLink was set to null by the previous call, we know we have no more data
page = if (nextLink <> null) then GetPage(nextLink) else null
in
page
)
in
GetAllTriplesMatching

The following support functions are used

Takes the n-triple result from the LDF service and looks for the row with predicate http://www.w3.org/ns/hydra/core#nextPage, returning the object that is the URL of the next page of triples.

let
GetNextLink = (response) as nullable text =>
let
Matching = Table.SelectRows(response,each [Predicate] = "<http://www.w3.org/ns/hydra/core#nextPage>"),
Link = if Table.IsEmpty(Matching) then
null
else
Record.FieldOrDefault(Table.SingleRow(Matching), "Object"),
LinkCleaned = if Link <> null then
else null
in
in
GetNextLink

### GetPage

Given a URL to the LDF service retrieves the results in n-triples and converts them to a table, returning the next link as metadata.

let
GetPage = (url as text) as nullable table =>
let
// Request Ntriples from the LDF Service
Options = [Headers = [#"Accept"="application/n-triples"]],
// Convert to CSV, parsing for ' ' as the separator
data = Csv.Document(Web.Contents(url,Options),{"Subject", "Predicate", "Object", "Terminator"}," "),
noTerminator = Table.RemoveColumns(data,"Terminator"),
cleaned = Table.TransformColumns(noTerminator, {{"Subject", each CleanNTriple(_), type text},
{"Predicate", each CleanNTriple(_), type text},
{"Object", each CleanNTriple(_), type text}}),
// this cleaning breaks the hydra next page links so look back
in
in
GetPage

### GenerateByPage

Repeatedly calls the LDF service through pagination to retrieve all triples matching the pattern.

let
GenerateByPage = (getNextPage as function) as table =>
let
listOfPages = List.Generate(
() => getNextPage(null),
(lastPage) => lastPage <> null,
(lastPage) => getNextPage(lastPage)
),
tableOfPages = Table.FromList(listOfPages,Splitter.SplitByNothing(),{"Column1"}),
firstRow = tableOfPages{0}?
in
if (firstRow = null) then
Table.FromRows({})
else
Value.ReplaceType(
Table.ExpandTableColumn(tableOfPages, "Column1", Table.ColumnNames(firstRow[Column1])),
Value.Type(firstRow[Column1])
)
in
GenerateByPage

### CleanNTriple

Removes the < and > characters from an individual value if present.

let
CleanNTriple = (triple) as nullable text =>
let
tripleCleaned = if Text.At(triple,0) = "<" and Text.At(triple,Text.Length(triple)-1) = ">" then
Text.Middle(triple,1,Text.Length(triple)-2)
else
triple,
cleanedString = ParseNTripleEscape(tripleCleaned)
in
cleanedString
in
CleanNTriple

### ParseNTripleEscape

Unescapes Java style character encodings by pushing through the JSON parser. Would be nice if there was a more efficient way of doing this but I haven’t found one.

let
ParseNTripleEscape = (escape) as nullable text =>
let
resolved = Record.Field(Json.Document(Text.Combine({"{""e"": """,escape,"""}"})),"e")
in
resolved
in
ParseNTripleEscape

## Putting it all together

Given these functions, this script creates a country dimension from WikiData with ISO two digit code and English language label for each country.

let
pred = "http://www.wikidata.org/prop/direct/P297",
data = GetAllTriplesMatching(null,pred,null),
onlyMatches = Table.SelectRows(data,each [Predicate] = pred ),
dropPredicate = Table.RemoveColumns(onlyMatches,{"Predicate"}),
renamed = Table.RenameColumns(dropPredicate,{{"Object", "ISOCode"}}),
Labled = Table.AddColumn(renamed, "GetLabel", each GetLabel([Subject])),
expanded = Table.ExpandTableColumn(Labled, "GetLabel", {"label"}, {"GetLabel.label"}),
renamedLabel = Table.RenameColumns(expanded,{{"GetLabel.label", "Label"}, {"Subject", "CountryURI"}})
in
renamedLabel

One huge advantage over the ADF variant: it’s much faster - maybe a minute for this to run.

In theory these functions could be packaged as a Power BI Custom Connector. That’s not high on my list of to-dos but if you’re interested, tweet me @nonodename and, with enough enthusiasm I’ll figure it out!