Development/Tutorials/Metadata/Nepomuk/AdvancedQueries: Difference between revisions

From KDE TechBase
(typo)
(→‎Full text queries before KDE 4.4: Remove it - totally useless)
 
(18 intermediate revisions by 8 users not shown)
Line 1: Line 1:
{{TutorialBrowser|
{{TutorialBrowser|
series=Nepomuk|
series=[[../|Nepomuk]]|
name=Advanced Sparql Queries in Nepoumuk|
name=Advanced Sparql Queries in Nepomuk|
pre=[[../RDFIntroduction|Introduction to RDF and Ontologies]], [[../NepomukServer|Nepomuk Server]]|
pre=[[../RDFIntroduction|Introduction to RDF and Ontologies]], [[../NepomukServer|Nepomuk Server]]|
next=|
next=|
Line 7: Line 7:
}}
}}


==Advanced Sparql Queries in Nepoumuk==
==Advanced Sparql Queries in Nepomuk==


In [[../NepomukServer|Nepomuk Server]] we learned how to access the Nepomuk Server to get a [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/classSoprano_1_1Model.html Soprano::Model] instance. We will now take a look at how to perform queries against the Nepomuk data repository.
We will now take a look at how to perform queries against the Nepomuk data repository.


{{Note|The queries presented here a pretty low-level. Soon Nepomuk will provide a wrapper class that makes every-day queries much simpler.}}
{{Note|The queries presented here a pretty low-level. Only use this approach if the [[../NepomukQuery|Nepomuk Query API]] does not fulfill your needs.}}
 
===The Main Model===
 
Nepomuk uses one main [http://soprano.sourceforge.net/apidox/stable/classSoprano_1_1Model.html Soprano model] which is accessed through the [http://api.kde.org/4.x-api/kdelibs-apidocs/nepomuk/html/classNepomuk_1_1ResourceManager.html ResourceManager]:
 
<syntaxhighlight lang="cpp-qt">
Soprano::Model* model = Nepomuk::ResourceManager::instance()->mainModel();
</syntaxhighlight>


===Query Basics===
===Query Basics===
Line 17: Line 25:
Basically performing a query with Nepomuk/Soprano always looks as follows (More details on using the iterator in the [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/classSoprano_1_1QueryResultIterator.html Soprano API documentation].):
Basically performing a query with Nepomuk/Soprano always looks as follows (More details on using the iterator in the [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/classSoprano_1_1QueryResultIterator.html Soprano API documentation].):


<code cppqt>
<syntaxhighlight lang="cpp-qt">
QString query = getFancyQueryString();
QString query = getFancyQueryString();
Soprano::QueryResultIterator it
Soprano::QueryResultIterator it
Line 26: Line 34:
   Soprano::BindingSet allBindings = *it;
   Soprano::BindingSet allBindings = *it;
}
}
</code>
</syntaxhighlight>




Line 33: Line 41:
Let us have a look at how a query can be constructed. As an example we will query for all resources that are tagged with a certain tag. Let's imagine that we have a reference to this tag stored in ''myTag''. (Please ignore the fact that Nepomuk::Tag::tagOf essentially returns the same information. After all, we are here to learn how it works.)
Let us have a look at how a query can be constructed. As an example we will query for all resources that are tagged with a certain tag. Let's imagine that we have a reference to this tag stored in ''myTag''. (Please ignore the fact that Nepomuk::Tag::tagOf essentially returns the same information. After all, we are here to learn how it works.)


<code cppqt>
<syntaxhighlight lang="cpp-qt">
#include <Soprano/Model>
#include <Soprano/Model>
#include <Soprano/QueryResultIterator>
#include <Soprano/QueryResultIterator>
Line 43: Line 51:


QString query
QString query
   = QString("select distinct ?r where { "?r <%1> <%2> . }")
   = QString("select distinct ?r where { ?r %1 %2 . }")
     .arg( Soprano::Vocabulary::NAO::hasTag().toString() )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
     .arg( myTag.resourceUri().toString() );
     .arg( Soprano::Node::resourceToN3(myTag.resourceUri()) );


Soprano::QueryResultIterator it
Soprano::QueryResultIterator it
Line 51: Line 59:
                           Soprano::Query::QueryLanguageSparql );
                           Soprano::Query::QueryLanguageSparql );
while( it.next() ) {
while( it.next() ) {
   myResourceList << Nepomuk::Resource( it.binding( "r" ) );
   myResourceList << Nepomuk::Resource( it.binding( "r" ).uri() );
}
}
</code>
</syntaxhighlight>


We begin by constructing the SPARQL query string. It is a simple query and if you know SQL it should be easy to understand. Basically we select resources that match the patterns in the ''where'' statement. In this case the resource needs to have the ''hasTag'' property with object ''myTag''. As we can see, Soprano already provides a set of standard URIs as static instances in the [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/namespaceSoprano_1_1Vocabulary.html Soprano::Vocabulary] namespace. And since we have the Nepomuk resource object for the tag we can simply use its unique URI to directly access the tagged resources.
We begin by constructing the SPARQL query string. It is a simple query and if you know SQL it should be easy to understand. Basically we select resources that match the patterns in the ''where'' statement. In this case the resource needs to have the ''hasTag'' property with object ''myTag''. As we can see, Soprano already provides a set of standard URIs as static instances in the [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/namespaceSoprano_1_1Vocabulary.html Soprano::Vocabulary] namespace. And since we have the Nepomuk resource object for the tag we can simply use its unique URI to directly access the tagged resources.
Line 61: Line 69:
Also no problem with SPARQL:
Also no problem with SPARQL:


<code cppqt>
<syntaxhighlight lang="cpp-qt">
QString myTagLabel = getFancytagLabel();
QString myTagLabel = getFancytagLabel();


QString query
QString query
   = QString("select distinct ?r where { "
   = QString("select distinct ?r where { "
             "?r <%1> ?tag . "
             "?r %1 ?tag . "
             "?tag <%2> \"%3\"^^<%4> . }")
             "?tag %2 %3 . }")
     .arg( Soprano::Vocabulary::NAO::hasTag().toString() )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
     .arg( Soprano::Vocabulary::RDFS::label().toString() )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::RDFS::label()) )
     .arg( myTagLabel )
     .arg( Soprano::Node(myTagLabel).toN3() );
    .arg( Soprano::Vocabulary::XMLSchema::string() );
</syntaxhighlight>
</code>


This already looks a lot more confusing as the previous example but that is mainly due to the QString argument paramters. Let's clean it up w bit by using SPARQL prefix declarations:
This already looks a lot more confusing as the previous example but that is mainly due to the QString argument paramters. Let's clean it up w bit by using SPARQL prefix declarations:


<code cppqt>
<syntaxhighlight lang="cpp-qt">
QString query
QString query
   = QString("PREFIX nao: <%1> "
   = QString("PREFIX nao: %1 "
             "PREFIX rdfs: <%2> "
             "PREFIX rdfs: %2 "
             "PREFIX xls: <%3> "
             "PREFIX xls: %3 "
             "select distinct ?r where { "
             "select distinct ?r where { "
             "?r nao:hasTag ?tag . "
             "?r nao:hasTag ?tag . "
             "?tag rdfs:label \"%4\"^^xls:string . }")
             "?tag rdfs:label \"%4\"^^xls:string . }")
     .arg( Soprano::Vocabulary::NAO::naoNamespace().toString() )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::naoNamespace()) )
     .arg( Soprano::Vocabulary::RDFS::rdfsNamespace().toString() )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::RDFS::rdfsNamespace()) )
     .arg( Soprano::Vocabulary::XMLSchema::xlsNamespace() )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::XMLSchema::xsdNamespace()) )
     .arg( myTagLabel );
     .arg( myTagLabel );
</code>
</syntaxhighlight>


Both queries are the same and it is up to the query writer to decide which version he or she prefers. We are just presenting both versions here for demonstration purposes.
Both queries are the same and it is up to the query writer to decide which version he or she prefers. We are just presenting both versions here for demonstration purposes.


Now let us analyse what is happening here. Instead of just matching a single graph pattern, we match two where the first one introduces another variable which is then reused in the second one. ''rdfs:label'' has a string literal range, meaning that each object related to a resource via the ''rdfs:label'' property is a string literal. And in this case we want to select the tag that has ''myTagLabel'' as its label.
Now let us analyse what is happening here. Instead of just matching a single graph pattern, we match two where the first one introduces another variable which is then reused in the second one. ''rdfs:label'' has a string literal range, meaning that each object related to a resource via the ''rdfs:label'' property is a string literal. And in this case we want to select the tag that has ''myTagLabel'' as its label.


===Bringing more context into the mix===
===Bringing more context into the mix===
Line 99: Line 105:
In [[../RDFIntroduction|Introduction to RDF and Ontologies]] we briefly learned about ''named graphs'' or ''context'' which make up the fourth part of each statement in Nepomuk. We can now use this information to filter our results based on creation dates. Imagine for example that we want to retrieve all resources tagged before the first of January 2008. We do this by introducing some more complex SPARQL syntax. For simplicity we go back to our first example of matching the tag URI directly to keep the query from getting too unreadable. But of course both can be combined. (Keep in mind that we only use the prefix syntax here for readability. In actual code it may be better to directly add the URIs from [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/namespaceSoprano_1_1Vocabulary.html Soprano::Vocabulary] to prevent typing errors in property and class names.)
In [[../RDFIntroduction|Introduction to RDF and Ontologies]] we briefly learned about ''named graphs'' or ''context'' which make up the fourth part of each statement in Nepomuk. We can now use this information to filter our results based on creation dates. Imagine for example that we want to retrieve all resources tagged before the first of January 2008. We do this by introducing some more complex SPARQL syntax. For simplicity we go back to our first example of matching the tag URI directly to keep the query from getting too unreadable. But of course both can be combined. (Keep in mind that we only use the prefix syntax here for readability. In actual code it may be better to directly add the URIs from [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/namespaceSoprano_1_1Vocabulary.html Soprano::Vocabulary] to prevent typing errors in property and class names.)


<code cppqt>
<syntaxhighlight lang="cpp-qt">
QDateTime firstOfJanuary = getFirstOfJanuary();
QDateTime firstOfJanuary = getFirstOfJanuary();


Line 105: Line 111:
   = QString("PREFIX nao: <%1> "
   = QString("PREFIX nao: <%1> "
             "PREFIX rdfs: <%2> "
             "PREFIX rdfs: <%2> "
            "PREFIX xls: <%3> "
             "select distinct ?r where { "
             "select distinct ?r where { "
             "graph ?g { ?r nao:hasTag <%4> . } "
             "graph ?g { ?r nao:hasTag <%3> . } "
             "?g nao:created ?time . "
             "?g nao:created ?time . "
             "FILTER(?time < \"%5\"^^xls:dateTime) . }")
             "FILTER(?time < %4) . }")
     .arg( Soprano::Vocabulary::NAO::naoNamespace().toString() )
     .arg( Soprano::Vocabulary::NAO::naoNamespace().toString() )
     .arg( Soprano::Vocabulary::RDFS::rdfsNamespace().toString() )
     .arg( Soprano::Vocabulary::RDFS::rdfsNamespace().toString() )
    .arg( Soprano::Vocabulary::XMLSchema::xlsNamespace() )
     .arg( myTag.resourceUri().toString() )
     .arg( myTag.resourceUri().toString() )
     .arg( Soprano::LiteralValue( firstOfJanuary ).toString() );
     .arg( Soprano::Node::literalToN3( firstOfJanuary ) );
</code>
</syntaxhighlight>


This query contains three new concepts:
This query contains three new concepts:
Line 124: Line 128:




===Full text queries===
=== Full text queries ===


While SPARQL in theory supports full text queries through the [http://www.w3.org/TR/rdf-sparql-query/#funcex-regex ''REGEX FILTER''] keywords the storage backends do not have their own real full text index. Thus, a full text search using SPARQL FILTER may become very slow if there are many statements to filter.
With KDE 4.4 Nepomuk depends on [http://soprano.sourceforge.net/apidox/trunk/soprano_backend_virtuoso.html Virtuoso] for data storage. Virtuoso brings a lot of nice [http://docs.openlinksw.com/virtuoso/rdfsparql.html#sparqlextensions extensions to SPARQL]. Most importantly the full text search which is used through the artificial ''bif:contains'' property.  


That is why in Soprano we have the [http://clucene.sourceforge.net/ CLucene] based [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/namespaceSoprano_1_1Index.html full text index model]. It is stacked on top of the actual storage model within the Nepomuk Server and provides a full text index on all literal object nodes in the repository. Since Soprano does not have a fancy query API yet (using plain strings as queries does not count as ''fancy'') full text queries have still to be performed separately. This may be inconvenient but will hopefully be solved in Soprano 3.
This allows to combine graph queries with full text queries in a nice way:
 
So for now we have to learn a second way to query the repository: using the [http://lucene.apache.org/java/docs/queryparsersyntax.html Lucene Query Language]. But that is much easier in most cases.
 
Let us assume that we want to search resources that are related to some literal object that matches the value "nepomuk". In SPARQL this would mean to query for:
 
<code>
select ?r where { ?r ?p ?o .
                  FILTER REGEX(STR(?o),'nepomuk', 'i') .
                  FILTER isLiteral(?o) . }
</code>
 
We convert the object literal into a string and match it to a regular expression ignoring case. This works but may be slow. Using the Soprano lucene full text index we perform this query as follows:
 
<code cppqt>
Soprano::QueryResultIterator it =
  model->executeQuery( "nepomuk",
                        Soprano::Query::QueryLanguageUser,
                        "lucene" );
while( it.next() ) {
  QUrl resource = it.binding( "resource" ).uri();
  double score = it.binding( "score" ).literal().toDouble();
}
</code>


Here we make use of the fact that Soprano allows to add new [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/namespaceSoprano_1_1Query.http user defined query languages]. Also we use the fixed mapping from CLucene query results to Soprano query bindings as defined in  [http://api.kde.org/kdesupport-api/kdesupport-apidocs/soprano/html/classSoprano_1_1Index_1_1IndexFilterModel.html Soprano::Index::IndexFilterModel]:
<syntaxhighlight lang="text">
select ?r where { ?r nao:prefLabel ?label .
                  ?label bif:contains 'nepomuk' . }
</syntaxhighlight>


* Binding ''resource'' always gives the matched resource.
The query above will find any resources that contain ''nepomuk'' in their label.
* Binding ''score'' always gives the lucene score (between 0 and 1).


These results can now be reused to perform further SPARQL queries.
Of course wildcards are supported, too. However, be aware that when using wildcards the expression itself needs to be enclosed in quotes as follows:


Of course it is possible to use the full range of the [http://lucene.apache.org/java/docs/queryparsersyntax.html Lucene Query Language]. Another simple example would be to only match a certain property:
<syntaxhighlight lang="text">
select ?r where { ?r nao:prefLabel ?label .
                  ?label bif:contains "'nepomuk*'" . }
</syntaxhighlight>


<code cppqt>
For most simple queries (simple queries do not use any back-referencing for example) the [[../NepomukQuery|Nepomuk desktop query API]] should be sufficient.
QString query =
    Soprano::Vocabulary::RDFS::label().toString()
    + ':' + "nepomuk";
Soprano::QueryResultIterator it =
  model->executeQuery( query,
                        Soprano::Query::QueryLanguageUser,
                        "lucene" );
</code>

Latest revision as of 08:16, 24 August 2012

Advanced Sparql Queries in Nepomuk
Tutorial Series   Nepomuk
Previous   Introduction to RDF and Ontologies, Nepomuk Server
What's Next  
Further Reading   SPARQL Quick Reference, SPARQL W3C Definition

Advanced Sparql Queries in Nepomuk

We will now take a look at how to perform queries against the Nepomuk data repository.

Note
The queries presented here a pretty low-level. Only use this approach if the Nepomuk Query API does not fulfill your needs.


The Main Model

Nepomuk uses one main Soprano model which is accessed through the ResourceManager:

Soprano::Model* model = Nepomuk::ResourceManager::instance()->mainModel();

Query Basics

Basically performing a query with Nepomuk/Soprano always looks as follows (More details on using the iterator in the Soprano API documentation.):

QString query = getFancyQueryString();
Soprano::QueryResultIterator it
   = model->executeQuery( query,
                          Soprano::Query::QueryLanguageSparql );
while( it.next() ) {
   Soprano::Node value = it.binding( "someVariableName" );
   Soprano::BindingSet allBindings = *it;
}


Simple Queries

Let us have a look at how a query can be constructed. As an example we will query for all resources that are tagged with a certain tag. Let's imagine that we have a reference to this tag stored in myTag. (Please ignore the fact that Nepomuk::Tag::tagOf essentially returns the same information. After all, we are here to learn how it works.)

#include <Soprano/Model>
#include <Soprano/QueryResultIterator>
#include <Soprano/Vocabulary/NAO>

[...]

Nepomuk::Tag myTag = getOurFancyTag();

QString query
   = QString("select distinct ?r where { ?r %1 %2 . }")
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
     .arg( Soprano::Node::resourceToN3(myTag.resourceUri()) );

Soprano::QueryResultIterator it
   = model->executeQuery( query, 
                          Soprano::Query::QueryLanguageSparql );
while( it.next() ) {
   myResourceList << Nepomuk::Resource( it.binding( "r" ).uri() );
}

We begin by constructing the SPARQL query string. It is a simple query and if you know SQL it should be easy to understand. Basically we select resources that match the patterns in the where statement. In this case the resource needs to have the hasTag property with object myTag. As we can see, Soprano already provides a set of standard URIs as static instances in the Soprano::Vocabulary namespace. And since we have the Nepomuk resource object for the tag we can simply use its unique URI to directly access the tagged resources.

But what if we do not have the tag URI but only its label, i.e. the name given by the user?

Also no problem with SPARQL:

QString myTagLabel = getFancytagLabel();

QString query
   = QString("select distinct ?r where { "
             "?r %1 ?tag . "
             "?tag %2 %3 . }")
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::hasTag()) )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::RDFS::label()) )
     .arg( Soprano::Node(myTagLabel).toN3() );

This already looks a lot more confusing as the previous example but that is mainly due to the QString argument paramters. Let's clean it up w bit by using SPARQL prefix declarations:

QString query
   = QString("PREFIX nao: %1 "
             "PREFIX rdfs: %2 "
             "PREFIX xls: %3 "
             "select distinct ?r where { "
             "?r nao:hasTag ?tag . "
             "?tag rdfs:label \"%4\"^^xls:string . }")
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::NAO::naoNamespace()) )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::RDFS::rdfsNamespace()) )
     .arg( Soprano::Node::resourceToN3(Soprano::Vocabulary::XMLSchema::xsdNamespace()) )
     .arg( myTagLabel );

Both queries are the same and it is up to the query writer to decide which version he or she prefers. We are just presenting both versions here for demonstration purposes.

Now let us analyse what is happening here. Instead of just matching a single graph pattern, we match two where the first one introduces another variable which is then reused in the second one. rdfs:label has a string literal range, meaning that each object related to a resource via the rdfs:label property is a string literal. And in this case we want to select the tag that has myTagLabel as its label.

Bringing more context into the mix

In Introduction to RDF and Ontologies we briefly learned about named graphs or context which make up the fourth part of each statement in Nepomuk. We can now use this information to filter our results based on creation dates. Imagine for example that we want to retrieve all resources tagged before the first of January 2008. We do this by introducing some more complex SPARQL syntax. For simplicity we go back to our first example of matching the tag URI directly to keep the query from getting too unreadable. But of course both can be combined. (Keep in mind that we only use the prefix syntax here for readability. In actual code it may be better to directly add the URIs from Soprano::Vocabulary to prevent typing errors in property and class names.)

QDateTime firstOfJanuary = getFirstOfJanuary();

QString query
   = QString("PREFIX nao: <%1> "
             "PREFIX rdfs: <%2> "
             "select distinct ?r where { "
             "graph ?g { ?r nao:hasTag <%3> . } "
             "?g nao:created ?time . "
             "FILTER(?time < %4) . }")
     .arg( Soprano::Vocabulary::NAO::naoNamespace().toString() )
     .arg( Soprano::Vocabulary::RDFS::rdfsNamespace().toString() )
     .arg( myTag.resourceUri().toString() )
     .arg( Soprano::Node::literalToN3( firstOfJanuary ) );

This query contains three new concepts:

  1. As we can see SPARQL does not simple add the context as fourth parameter but needs us to suround the triples we want to match into a certain context with the graph keyword.
  2. We use the SPARQL FILTER keyword to filter out only those graphs/contexts that have a nao:created value smaller than January, first.
  3. We use Soprano::LiteralValue instead of QDateTime directly. This is important since QDateTime does not support the RDF way of formatting a dateTime string. Thus, we need to use Soprano's internal dateTime string conversion algorithm by using LiteralValue.


Full text queries

With KDE 4.4 Nepomuk depends on Virtuoso for data storage. Virtuoso brings a lot of nice extensions to SPARQL. Most importantly the full text search which is used through the artificial bif:contains property.

This allows to combine graph queries with full text queries in a nice way:

select ?r where { ?r nao:prefLabel ?label .
                  ?label bif:contains 'nepomuk' . }

The query above will find any resources that contain nepomuk in their label.

Of course wildcards are supported, too. However, be aware that when using wildcards the expression itself needs to be enclosed in quotes as follows:

select ?r where { ?r nao:prefLabel ?label .
                  ?label bif:contains "'nepomuk*'" . }

For most simple queries (simple queries do not use any back-referencing for example) the Nepomuk desktop query API should be sufficient.