Notes

Querying Wikidata with SPARQL

April 21, 2020

A few days ago I wanted to find a dataset with a list of winners of the Palme d’Or. The table on Wikipedia wasn’t formatted in a way that would be easy to copy, and I figured this was as good a time as any to figure out how Wikidata works.

Wikidata is a Wikimedia Foundation project that aggregates structured information about the world. It stores and retrieves facts based on an item/property/value system. Items are nouns, like Parasite. Properties define a relation, such as “award received” or “director”; they also specify another item as the value associated with that property, like Palme d’Or or Bong Joon-ho.

Wikidata provides a public query interface that uses a query language called SPARQL. No relation with Spark: it’s actually running on a graph database called Blazegraph. The query language works on a sort of fill-in-the-blanks basis with the item/property/value triples.

For example, to see which awards Parasite won, you can write a query like this:

SELECT ?award ?awardLabel
WHERE {
    #  Q61448040 is Parasite 
    #  P166 is "award won" 
    wd:Q61448040 wdt:P166 ?award
    SERVICE wikibase:label {bd:serviceParam wikibase:language "en"}
}

Here’s that query in the query editor. The core of this is the wd:Q61448040 wdt:P166 ?award in the WHERE clause: you’re basically saying “get all X where Q61448040 has the relationship P166 with X”. Statements can be composed or nested within queries. Here’s a list of sandwich ingredients generated by querying for instances of sandwiches, then returning the items that are listed as each sandwich’s ingredients. Cool.

(The SERVICE wikibase:label stuff is just there to provide the English names associated with the entities returned: it automatically populates the ?awardLabel in the SELECT statement.)

Properties themselves are also associated with their own values. For example, Parasite won the Palme d’Or in 2019; the “award won” relationship described earlier encodes the year using its own “point in time” statement, associated with a timestamp. These sub-properties, called qualifiers, can also be queried.

I decided I wanted to retrieve the title, award year, and director names for all the Palme d’Or winners. To retrieve that data, the final query’s inner SELECT statement is composed of the following statements. First, it retrieves items that are instances of the “Film” concept or a subclass thereof:

# P31: instance of; P279: subclass of; Q11424: Film
?film wdt:P31/wdt:P279* wd:Q11424;

Such films are then filtered to only return instances that have the “award received” property with a value of “Palme d’Or”. The pq:P585 ?date bit returns the object of the “point in time” property of the “award received” property as ?date:

# P166: award received; Q179808: Palme d'Or; P585: point in time
p:P166 [ ps:P166 wd:Q179808; pq:P585 ?date ]. 

Finally, any items that the film is associated with through the “Director” property are also returned:

OPTIONAL {
  # P57: director
  ?film wdt:P57 ?director
}

This is then aggregated with a regular SQL-like GROUP BY and cleaned up with string aggregation and date formatting functions. I also had to configure the labels (film title and director name) manually, since the auto-detection doesn’t know how to parse more complex expressions in SELECT statements.

The final query is available below and as a gist. You can also run it on Wikidata directly.

SELECT
  ?film
  ?title
  ?directors
  ?year
WHERE
{
  SELECT
    ?film
    ?title
    (GROUP_CONCAT(?directorLabel; SEPARATOR=" & ") AS ?directors)
    (STR(YEAR(?date)) AS ?year)
  WHERE
  {
    # P31: instance of; P279: subclass of; Q11424: Film
    ?film wdt:P31/wdt:P279* wd:Q11424; 
    # P166: award received; Q179808: Palme d'Or; P585: point in time
    p:P166 [ ps:P166 wd:Q179808; pq:P585 ?date ]. 
    OPTIONAL {
      # P57: director
      ?film wdt:P57 ?director
    }
    SERVICE wikibase:label {
      bd:serviceParam wikibase:language "en" .
      ?director rdfs:label ?directorLabel .
      ?film rdfs:label ?title
    }
  }
  GROUP BY
    ?film
    ?title
    ?date
}
ORDER BY ?year