Development/Tutorials/Metadata/Nepomuk/AdvancedQueries

From KDE TechBase
Advanced Sparql Queries in Nepomuk
Tutorial Series   Nepomuk
Previous   Introduction to RDF and Ontologies, Nepomuk Server
What's Next  
Further Reading   SPARQL Quick Reference, SPARQL W3C Definition

Advanced Sparql Queries in Nepomuk

In Nepomuk Server we learned how to access the Nepomuk Server to get a Soprano::Model instance. We will now take a look at how to perform queries against the Nepomuk data repository.

Note
The queries presented here a pretty low-level. Soon Nepomuk will provide a wrapper class that makes every-day queries much simpler.


Query Basics

Basically performing a query with Nepomuk/Soprano always looks as follows (More details on using the iterator in the Soprano API documentation.):

QString query = getFancyQueryString(); Soprano::QueryResultIterator it

  = model->executeQuery( query,
                         Soprano::Query::QueryLanguageSparql );

while( it.next() ) {

  Soprano::Node value = it.binding( "someVariableName" );
  Soprano::BindingSet allBindings = *it;

}


Simple Queries

Let us have a look at how a query can be constructed. As an example we will query for all resources that are tagged with a certain tag. Let's imagine that we have a reference to this tag stored in myTag. (Please ignore the fact that Nepomuk::Tag::tagOf essentially returns the same information. After all, we are here to learn how it works.)

  1. include <Soprano/Model>
  2. include <Soprano/QueryResultIterator>
  3. include <Soprano/Vocabulary/NAO>

[...]

Nepomuk::Tag myTag = getOurFancyTag();

QString query

  = QString("select distinct ?r where { ?r %1 %2 . }")
    .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
    .arg( Soprano::Node::resourceToN3(myTag.resourceUri()) );

Soprano::QueryResultIterator it

  = model->executeQuery( query, 
                         Soprano::Query::QueryLanguageSparql );

while( it.next() ) {

  myResourceList << Nepomuk::Resource( it.binding( "r" ).uri() );

}

We begin by constructing the SPARQL query string. It is a simple query and if you know SQL it should be easy to understand. Basically we select resources that match the patterns in the where statement. In this case the resource needs to have the hasTag property with object myTag. As we can see, Soprano already provides a set of standard URIs as static instances in the Soprano::Vocabulary namespace. And since we have the Nepomuk resource object for the tag we can simply use its unique URI to directly access the tagged resources.

But what if we do not have the tag URI but only its label, i.e. the name given by the user?

Also no problem with SPARQL:

QString myTagLabel = getFancytagLabel();

QString query

  = QString("select distinct ?r where { "
            "?r %1 ?tag . "
            "?tag %2 %3 . }")
    .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
    .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::RDFS::label()) )
    .arg( Soprano::Node(myTagLabel).toN3() );

This already looks a lot more confusing as the previous example but that is mainly due to the QString argument paramters. Let's clean it up w bit by using SPARQL prefix declarations:

QString query

  = QString("PREFIX nao: %1 "
            "PREFIX rdfs: %2 "
            "PREFIX xls: %3 "
            "select distinct ?r where { "
            "?r nao:hasTag ?tag . "
            "?tag rdfs:label \"%4\"^^xls:string . }")
    .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::naoNamespace()) )
    .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::RDFS::rdfsNamespace()) )
    .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::XMLSchema::xlsNamespace()) )
    .arg( myTagLabel );

Both queries are the same and it is up to the query writer to decide which version he or she prefers. We are just presenting both versions here for demonstration purposes.

Now let us analyse what is happening here. Instead of just matching a single graph pattern, we match two where the first one introduces another variable which is then reused in the second one. rdfs:label has a string literal range, meaning that each object related to a resource via the rdfs:label property is a string literal. And in this case we want to select the tag that has myTagLabel as its label.

Bringing more context into the mix

In Introduction to RDF and Ontologies we briefly learned about named graphs or context which make up the fourth part of each statement in Nepomuk. We can now use this information to filter our results based on creation dates. Imagine for example that we want to retrieve all resources tagged before the first of January 2008. We do this by introducing some more complex SPARQL syntax. For simplicity we go back to our first example of matching the tag URI directly to keep the query from getting too unreadable. But of course both can be combined. (Keep in mind that we only use the prefix syntax here for readability. In actual code it may be better to directly add the URIs from Soprano::Vocabulary to prevent typing errors in property and class names.)

QDateTime firstOfJanuary = getFirstOfJanuary();

QString query

  = QString("PREFIX nao: <%1> "
            "PREFIX rdfs: <%2> "
            "select distinct ?r where { "
            "graph ?g { ?r nao:hasTag <%3> . } "
            "?g nao:created ?time . "
            "FILTER(?time < %4) . }")
    .arg( Soprano::Vocabulary::NAO::naoNamespace().toString() )
    .arg( Soprano::Vocabulary::RDFS::rdfsNamespace().toString() )
    .arg( myTag.resourceUri().toString() )
    .arg( Soprano::Node::literalToN3( firstOfJanuary ) );

This query contains three new concepts:

  1. As we can see SPARQL does not simple add the context as fourth parameter but needs us to suround the triples we want to match into a certain context with the graph keyword.
  2. We use the SPARQL FILTER keyword to filter out only those graphs/contexts that have a nao:created value smaller than January, first.
  3. We use Soprano::LiteralValue instead of QDateTime directly. This is important since QDateTime does not support the RDF way of formatting a dateTime string. Thus, we need to use Soprano's internal dateTime string conversion algorithm by using LiteralValue.


Full text queries

With KDE 4.4 Nepomuk depends on Virtuoso for data storage. Virtuoso brings a lot of nice extensions to SPARQL. Most importantly the full text search which is used through the artificial bif:contains property.

This allows to combine graph queries with full text queries in a nice way:

select ?r where { ?r nao:prefLabel ?label .

                 ?label bif:contains 'nepomuk' . }

The query above will find any resources that contain nepomuk in their label.

Of course wildcards are supported, too. However, be aware that when using wildcards the expression itself needs to be enclosed in quotes as follows:

select ?r where { ?r nao:prefLabel ?label .

                 ?label bif:contains "'nepomuk*'" . }

For most simple queries (simple queries do not use any back-referencing for example) the Nepomuk desktop query API should be sufficient.


Full text queries before KDE 4.4

While SPARQL in theory supports full text queries through the REGEX FILTER keywords the storage backends do not have their own real full text index. Thus, a full text search using SPARQL FILTER may become very slow if there are many statements to filter.

That is why in Soprano we have the CLucene based full text index model. It is stacked on top of the actual storage model within the Nepomuk Server and provides a full text index on all literal object nodes in the repository. Since Soprano does not have a fancy query API yet (using plain strings as queries does not count as fancy) full text queries have still to be performed separately. This may be inconvenient but will hopefully be solved in Soprano 3.

So for now we have to learn a second way to query the repository: using the Lucene Query Language. But that is much easier in most cases.

Let us assume that we want to search resources that are related to some literal object that matches the value "nepomuk". In SPARQL this would mean to query for:

select ?r where { ?r ?p ?o .

                 FILTER REGEX(STR(?o),'nepomuk', 'i') . 
                 FILTER isLiteral(?o) . }

We convert the object literal into a string and match it to a regular expression ignoring case. This works but may be slow. Using the Soprano lucene full text index we perform this query as follows:

Soprano::QueryResultIterator it =

  model->executeQuery( "nepomuk",
                       Soprano::Query::QueryLanguageUser,
                       "lucene" );

while( it.next() ) {

  QUrl resource = it.binding( "resource" ).uri();
  double score = it.binding( "score" ).literal().toDouble();

}

Here we make use of the fact that Soprano allows to add new user defined query languages. Also we use the fixed mapping from CLucene query results to Soprano query bindings as defined in Soprano::Index::IndexFilterModel:

  • Binding resource always gives the matched resource.
  • Binding score always gives the lucene score (between 0 and 1).

These results can now be reused to perform further SPARQL queries.

Of course it is possible to use the full range of the Lucene Query Language. Another simple example would be to only match a certain property:

QString query =

   Soprano::Vocabulary::RDFS::label().toString()
   + ':' + "nepomuk";

Soprano::QueryResultIterator it =

  model->executeQuery( query,
                       Soprano::Query::QueryLanguageUser,
                       "lucene" );