UMBC CMSC 491/691 Fall 2022
Knowledge Graphs

Home · Schedule · HW · Exams · Notes · GitHub · Examples · Resources · Colab · Discord · Webex

 

Homework 2: Writing SPARQL Queries

out 2022-10-27, due 2022-11-13

get your repo link

This homework will give you some experience in using the RDF query language SPARQL to access data in DBpedia.

SPARQL is a relatively straightforward query language for RDF data that will seem familiar to people to know SQL. It does have a number of unique features that stem from the RDF model. What is different, though, is how we have to approach querying large semi-structured collection like DBpedia and Wikidata. A typical relational database has a small number of tables each with a fixed number of columns and every row in table has a value for each column. The database schema is thus uniform and relatively small.

The most recent English DBpedia snapshot has about 1B facts supported by a ontology. You can read itsĀ documentation and visualizations, look at itsĀ class tree and properties, and see its wiki on how to modify it.. Wikidata is much larger and in some ways more sophisticated and complicated and is an important resource for many applications.

In querying DBpedia you need to know the URIs of any subjects, predicates of objects that are part of your query. For example, if you want know what schools Barack Obama attended, you must know how this information is represented, the URI for the president Obama, the URI for the predicate or predicates that represent relationship "attended a school" and possibly the type(s) for schools. You can easily find URIs and discover relevant properties accessing DBpedia or Wikidata in a Web browser, starting at an appropriate page. Use a conventional search engine to find an entry point. If you are looking for the page on Barack Obama, you might search for DBpedia Barack Obama and the appropriate page will probably be one of the first few hits returned. Once you are browsing a DBpedia page, you can follow its links to other pages.

The standard public sparql endpoint for DBpedia is http://dbpedia.org/sparql. Visiting this in your browser takes you to a page in which you can enter and execute a SPARQL query via a public system offered by Open Link Software. Their triple store technology is very good, has both a commercial and open source versions and supports most (but not all) of the SPARQL 1.1 standard (see here for some examples).

Getting started

Clone the HW2 Git repository. The directory has a stub for each of the queries you have to write with a name like q??.txt for a DBpedia query. Edit each stub to be a working query, verifying that it works using the appropriate web based SPARQL client (i.e., http://dbpedia.org/sparql or yasgui) or just use the python script sparql.py to run the query and produce output files.

The sparql.py script requires the SPARQLwrapper package, which can be easily installed with pip. If you are running on your own computer, you should be able to install it with the following command:

pip install sparqlwrapper
If you are using gl or some other computer where you cannot install python packages centrally, you can install a local version with the following command
pip install --user sparqlwrapper

Example DBpedia queries

Four sample queries are shown below. For each there is a link to the query text, the results of running it as JSON and a simple table, and a link to the query in the yasgui SPARQL interface. If you encounter a problem using yasgui, make sure that the URL for the SPARQL server starts with HTTPS and not HTTP.

Query qex1 (example json, result) (open in yasgui)

We've done this as an example, which is the contents of the file qex1.txt.

# Who are Donald Trump's spouses?
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
SELECT ?P WHERE {dbr:Donald_Trump dbo:spouse ?P}

The dbr namespace is used to refer to DBpedia things (e.g., Obama, UMBC), dbo to refer to DBpedia ontology terms (e.g., Person, child) , and dbp to DBpedia properties that come from Wikipedia infoboxes (e.g., profession). When we executed the command python sparql.py qex1.txt, the files qex1.txt.json and qex1.txt.html are produced. The first is an encoding of the server's response as a json object (easy for programs to use) and the second is an encoding as a simple html table with URIs linked to an appropriate DBpedia page (good for humans to browse).

Note that a knowledge graph and/or its endpoint typically predefines short prefixes for the most common URI used in its data. The full list of prefixes supported by the standard DBpedia SPARQL endpoint is here and includes dbo, dbr, and dbp. So you do not need to define those prefixes in your queries, though doing so may mean they are used in the output.

Query qex2 (example json, result) (open in yasgui)

# Who are Donald Trump's children and what schools did they attend?
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
SELECT ?child ?school WHERE {
dbr:Donald_Trump dbo:child ?child.
?child dbo:almaMater ?school. }
The results for qex2 were obviously not good. Looking at more examples of how DBpedia links a person to a school they attended shows that several properties are used, depending on whether the link comes from a Wikipedia Info box or the article text. A second issue seems to be that there is a variation in the URIs used for the properties. Here's a better query that fixes these problems. The query uses the vertical bar (|) to denote that any of the four properties given satisfy the triple pattern.

Query qex2a (example json, open in yasgui)

# Who are Donald Trump's children and what schools did they attend?
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
SELECT ?child ?school WHERE {
dbr:Donald_Trump dbo:child ?child.
?child dbo:almaMater|dbp:almaMater|dbo:education|dbp:education ?school. }

The next example is a simple extension query qex1. Recognizing who is a U.S. President will take some investigation. The native DBpedia ontology has a relatively small set of about 1000 types. It does not have a type for "US Presidents". However, DBpedia also uses types from yago, which has order 100K fine grained types. So, look at the rdfs:type values for a presidents in dbpedia.org. All you need to is add another triple pattern. and see if you can find a good type to use. I found that the best way to identify a U.S. President was the intersection of yago:WikicatPresidentsOfTheUnitedStates and dbo:Person. Wikipedia categories, like Presidents of the United States refer to pages for presidents but also have links to subcategories about U.S. Presidents. One other minor problem was that some people have schools that are an empty string, so these are filtered out.

Query qex3 (example json, result) (open in yasgui)

# Show US presidents and their children who attended the same school
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX yago: <http://dbpedia.org/class/yago/>
SELECT DISTINCT ?president ?child  ?school WHERE
  {?president a yago:WikicatPresidentsOfTheUnitedStates,
                dbo:Person;
         dbo:child ?child;
         dbo:almaMater|dbp:almaMater|dbo:education|dbp:education ?school.
    ?child dbo:almaMater|dbp:almaMater|dbo:education|dbp:education ?school.
    FILTER (?school != ""@en) }

DBpedia query problems to solve

For each of following seven queries, write a SPARQL query in your repo's file qxx.txt that produces a reasonable answer. We've provided answers as both a json object and a simple HTML table we thought were reasonable. Your answers might differ a bit, but try to be close. You might, in fact, get a better answer. You can generate both the json and table results by calling the sparql.py script with an argument that is the name of your query file, e.g. q00.txt.

Query q00 (example json, result)

Find rivers in Maryland

This one is easy, but you will have to determine the IRI to use for Maryland, the appropriate type (or types!) to recognize that something is a river or a stream and the right property or properties that link a river to a state. A good approach is to sample a few rivers (e.g., Mississippi, Patapsco, Linganore Creek) and see how their data is represented.

Query q01 (example json, result)

find rivers in Maryland and (optionally) their lengths

This is a simple modification of q00 that requires the use of the sparql's OPTIONAL feature so that we return rivers even if the length is not available.

Query q02 (example json, result)

Find rivers in Maryland and (optionally) their lengths, ordered from longest to shortest

This is another simple extension of the previous query that requires an ORDER BY clause.

Query q03 (example json, result)

Is there a US state that has no river?

Write as an ASK query, which will return either True or False. You will need to use FILTER and NOT EXISTS. I thought it would be easy, but had a difficult time finding a way to get the the 50 U.S. states. Here's what I came up with for that. Feel free to use this as part of your solution to this query and the next. You can see all of the relevant prefixes for DBpedia here.

  SELECT ?state {
       ?state dct:subject dbc:States_of_the_United_States.
       FILTER NOT EXISTS {?state a yago:WikicatSubdivisionsOfTheUnitedStates.}
       FILTER (?state != dbr:Admission_to_the_Union)

Query q04 (example json, result)

Find the number of rivers in each US state, ordered by state

This query will require using GROUP BY and COUNT.

Query q05 (example json, result)

Find rivers or streams in Maryland that directly empty into the Chesapeake Bay

Rivers empty into another body of water, such as an ocean, a lake, a bay, or another river. Examine some rivers in DBpedia (e.g., Mississippi, Patapsco, Linganore Creek) and see how to DBpedia represents this. Then find all of the rivers or streams whose water directly go to the Chesapeake Bay.

Query q06 (example json, result)

Find rivers or streams in Maryland whose waters eventually end up in the Chesapeake Bay

Water from a stream might empty into a small rive that empties into a larger river that empties into the Bay. You should identity all of the rivers or streams whose water eventually ends up in the Bay.

Query q07 (example json, result)

Find rivers or streams in any state whose waters eventually end up in the Chesapeake Bay

Rivers or streams from other states (e.g., Delaware) can eventually end up in the Bay.

Writing your queries

Write each of the queries in separate file, with names from q0.txt to q07.txt. You can debug your queries using a web based SPARQL client like yasgui or the native DBpedia web service. You should eventually use the simple python script sparql.py which takes one of more query files, runs them through the endpoint, and writes out the response into files as both json and html. You can use a command line statement like

	  python sparql.py q??.txt
to run all of your queries and create the .json and .html files. You can then compare your output to that produced by our model solutions by looking at the JSON and HTML files.

Note: be sure to direct your DBpedia queries to the endpoint https://dbpedia.org/sparql or to Yasgui (be sure the endpoint is set to https://dbpedia.org/sparql)

What and how to submit

Get the files from your HW2 Git repository either by cloning it on your own computer (if you have Python installed on it) or your gl account and extract the contents. Stubs for the queries are in files q0?.txt. Edit each of these to be a working query. You can debug your query using one of public Web clients (see above) or by using the sparql.py program. The sample output includes the answers our model solution provides. You answers might vary a bit, but hopefully not too much.

When you are done, you can rerun all of your queries with a command like python sparql.py q??.txt, then commit all of the *.txt, *.txt.json and *.txt.html files and push them back to the master on github. You should also answer the questions in your README.md file and push it to GitHub.