UMBC CMSC 491/691 Fall 2022
Knowledge Graphs

Home · Schedule · HW · Exams · Notes · GitHub · Examples · Resources · Colab · Discord · Webex


Homework 3: Querying Wikidata with SPARQL

out 2022-11-15, due 2022-11-30

get your repo link

This homework will give you some experience getting data from Wikidata in using SPARQL.

The standard public sparql endpoint for DBpedia is http://dbpedia.org/sparql. Visiting this in your browser takes you to a page in which you can enter and execute a SPARQL query via a public system offered by Open Link Software. Their triple store technology is very good, has both a commercial and open source versions and supports most (but not all) of the SPARQL 1.1 standard (see here for some examples).

Getting started

Clone the HW3 Git repository. The directory has a stub for each of the queries you have to write with a name like q??.txt. Edit each stub to be a working query, verifying that it works using the Wikidata Query Service, Yasgui (set its endpoint to https://query.wikidata.org/), or the python script sparql.py to run the query and produce output files. The script has been updated to make the Wikidata SPARQL endpoint the default

Example Wikidata queries

Some sample queries are shown below. For each there is a link to the query text, the results of running it as JSON and a simple table, and a link to the query in the yasgui SPARQL interface. If you encounter a problem using yasgui, make sure that the URL for the SPARQL server starts with HTTPS and not HTTP.

Query qex1 (example json, result) (open in WDQS)

We've done this as an example, which is the contents of the file qex1.txt.

# Who are Trump's current and past spouses
SELECT ?spouse ?spouseLabel ?rank
WHERE {
wd:Q22686 p:P26 [ps:P26 ?spouse; wikibase:rank ?rank] .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

Note that this query is more complicated than the similar example in HW2, since the spouse properties may be ranked, with only the current spouse having 'preferred' rank and former spouses having 'normal' rank.

All of the prefixes you will need (e.g., wd, wdt, p, ps, wikibase) are predefined in the WDQS, so you should not need to add them to you queries.

Query qex2 (example json, result) (open in WDQS)

# Who are Donald Trump's children the schools attend?
SELECT ?child ?childLabel ?school ?schoolLabel
WHERE {
wd:Q22686 wdt:P40 ?child.
?child wdt:P69 ?school.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en".}
}

Query qex3 (example json, result) (open in WDQS)

# Show US presidents and their children who attended the same school
SELECT DISTINCT ?presidentLabel ?childLabel ?schoolLabel
WHERE {
?president wdt:P39 wd:Q11696;
wdt:P40 ?child;
wdt:P69 ?school.
?child wdt:P69 ?school.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en".}
}
ORDER BY ?presidentLabel

Wikidata query problems to solve

For each of following seven queries, write a SPARQL query in your repo's file qxx.txt that produces a reasonable answer. We've provided answers as both a json object and a simple HTML table we thought were reasonable. Your answers might differ a bit, but try to be close. You might, in fact, get a better answer. You can generate both the json and table results by calling the sparql.py script with an argument that is the name of your query file, e.g. q00.txt.

Query q1 (example json, result)

Find rivers in Maryland

This one is easy, but you will have to determine the IRI to use for Maryland, the appropriate type (or types!) to recognize that something is a river or a stream and the right property or properties that link a river to a state. A good approach is to sample a few rivers (e.g., Mississippi, Patapsco, Linganore Creek) and see how their data is represented.

Query q2 (example json, result)

find rivers in Maryland and (optionally) their lengths, ordered from longest to shortest

This is a simple modification of q1 that requires the use of the sparql's OPTIONAL feature so that we return rivers even if the length is not available.

Query q3 (example json, result)

Find the number of rivers in each US state, ordered by state

This query will require using GROUP BY and COUNT.

Query q4 (example json, result)

Find rivers, streams or creeks in Maryland whose waters eventually end up in the Chesapeake Bay

Water from a stream might empty into a small rive that empties into a larger river that empties into the Bay. You should identity all of the rivers or streams whose water eventually ends up in the Bay.

Query q5 (example json, result)

Find subclasses of the Wikidata Malware class and for each (including Malware), find the number of direct instances and all instances (direct and those all of its subclasses. Sort the results by the number of all instances.

Query q6 (example json, result)

Find instances programming languages that have at least one sitelink. For each, show its ID, name, and number of sitelinks. Order by the number of sitelinks, descending.

A sitelink is a link from an item to an article on a Wikimedia site about the item, such as a page in a Wikipedia site. The number of sitelinks an item has is one way to estimate its prominence and is useful for many purposes. You can find an item's sitelinks using the schema:about property in Wikidata.

Query q7 (example json, result)

Find instances of software and the programming language(s) or frameworks(s) they were implemented in. For each programming language or framework that was used to implement at least one software item, show its ID, name and how many software instances it was used in. Order by that number descending.

Writing your queries

Write each of the queries in separate file, with names from q1.txt to q7.txt. You can debug your queries using a web based Wikidata Query service or a client like yasgui. You should eventually use the simple python script sparql.py which takes one of more query files, runs them through the endpoint, and writes out the response into files as both json and html. You can use a command line statement like

	  python sparql.py q?.txt

to run all of your queries and create the .json and .html files. You can then compare your output to that produced by our model solutions by looking at the JSON and HTML files. Our model solutions produced the following output.

Query q1.txt on https://query.wikidata.org/sparql returned 85 results
Query q2.txt on https://query.wikidata.org/sparql returned 85 results
Query q3.txt on https://query.wikidata.org/sparql returned 50 results
Query q4.txt on https://query.wikidata.org/sparql returned 73 results
Query q5.txt on https://query.wikidata.org/sparql returned 125 results
Query q6.txt on https://query.wikidata.org/sparql returned 1560 results
Query q7.txt on https://query.wikidata.org/sparql returned 328 results

What and how to submit

Get the files from your HW3 Git repository either by cloning it on your own computer (if you have Python installed on it) or your gl account and extract the contents. Stubs for the queries are in files q?.txt. Edit each of these to be a working query. You can debug your query using one of public Web clients (see above) or by using the sparql.py program. The sample output includes the answers our model solution provides. You answers might vary a bit, but hopefully not too much.

When you are done, you can rerun all of your queries with a command like python sparql.py q?.txt, then commit all of the *.txt, *.txt.json and *.txt.html files and push them back to the master on github. You should also answer the questions in your README.md file and push it to GitHub.