Development/Tutorials/Metadata/Nepomuk/DataLayout: Difference between revisions

    From KDE TechBase
    m (Text replace - "<code cppqt>" to "<syntaxhighlight lang="cpp-qt">")
    m (Text replace - "<code>" to "<syntaxhighlight lang="text">")
    Line 24: Line 24:
    Nepomuk uses the [[../StrigiService|Strigi]] file analysis system to extract meta data from files and cache it in Nepomuk for powerful desktop search. As described in [[../RDFIntroduction|RDF and Ontologies in Nepomuk]]| Strigi uses the [http://www.semanticdesktop.org/ontologies/nie/ NIE] ontologies to store the data. For each analyzed file a new named graph (or context in Soprano terms) is created. For a more general discussion of graphs see [[../RDFIntroduction#NRL_-_The_Nepomuk_Representation_Language|NRL - The Nepomuk Representation Language]].) All extracted meta data about the file is stored in this one graph which is then linked to the file resource itself via the property ''http://www.strigi.org/fields#indexGraphFor'' (be aware that this property will hopefully soon be replaced by a standard in NIE). To make this more concrete, let us look at an example:
    Nepomuk uses the [[../StrigiService|Strigi]] file analysis system to extract meta data from files and cache it in Nepomuk for powerful desktop search. As described in [[../RDFIntroduction|RDF and Ontologies in Nepomuk]]| Strigi uses the [http://www.semanticdesktop.org/ontologies/nie/ NIE] ontologies to store the data. For each analyzed file a new named graph (or context in Soprano terms) is created. For a more general discussion of graphs see [[../RDFIntroduction#NRL_-_The_Nepomuk_Representation_Language|NRL - The Nepomuk Representation Language]].) All extracted meta data about the file is stored in this one graph which is then linked to the file resource itself via the property ''http://www.strigi.org/fields#indexGraphFor'' (be aware that this property will hopefully soon be replaced by a standard in NIE). To make this more concrete, let us look at an example:


    <code>
    <syntaxhighlight lang="text">
    <urn:nepomuk:local:eb343fa9-47ec-4dae-b8d0-fb10c7b63f3d> {
    <urn:nepomuk:local:eb343fa9-47ec-4dae-b8d0-fb10c7b63f3d> {
       <file:///home/trueg/nepomuk_kio.diff>  
       <file:///home/trueg/nepomuk_kio.diff>  
    Line 58: Line 58:
    Thus, if we were interested in getting only the meta data extracted by Strigi and not any manually added data we could use a query like the following:
    Thus, if we were interested in getting only the meta data extracted by Strigi and not any manually added data we could use a query like the following:


    <code>
    <syntaxhighlight lang="text">
    select ?p ?o where {
    select ?p ?o where {
         graph ?g { <file:///home/trueg/nepomuk_kio.diff> ?p ?o . } .
         graph ?g { <file:///home/trueg/nepomuk_kio.diff> ?p ?o . } .
    Line 68: Line 68:
    Or if we wanted the exact opposite: all data excluding Strigi extracted data (interesting for a backup maybe since Strigi data can be recreated from the file):
    Or if we wanted the exact opposite: all data excluding Strigi extracted data (interesting for a backup maybe since Strigi data can be recreated from the file):


    <code>
    <syntaxhighlight lang="text">
    select ?p ?o where {
    select ?p ?o where {
         graph ?g { <file:///home/trueg/nepomuk_kio.diff> ?p ?o . } .
         graph ?g { <file:///home/trueg/nepomuk_kio.diff> ?p ?o . } .
    Line 86: Line 86:
    In any case the created data has the same layout which will be familiar now. Let us look at another example of tagging a file:
    In any case the created data has the same layout which will be familiar now. Let us look at another example of tagging a file:


    <code>
    <syntaxhighlight lang="text">
    <urn:nepomuk:local:9a70ed7b-1fc7-4680-9c65-efcf89010352> {
    <urn:nepomuk:local:9a70ed7b-1fc7-4680-9c65-efcf89010352> {
       <file:///home/trueg/nepomuk_kio.diff>  
       <file:///home/trueg/nepomuk_kio.diff>  
    Line 116: Line 116:
    Having the creation date of each graph allows us to perform statistical queries on annotation usage. We could for example check which were the last 5 annotations performed:
    Having the creation date of each graph allows us to perform statistical queries on annotation usage. We could for example check which were the last 5 annotations performed:


    <code>
    <syntaxhighlight lang="text">
    select ?p ?o where {
    select ?p ?o where {
         graph ?g { ?r ?p ?o . } .
         graph ?g { ?r ?p ?o . } .
    Line 140: Line 140:
    Apart from actual data and meta data Nepomuk also stores the ontologies themselves in the database. The [[../OntologyLoaderService|Nepomuk Ontology Loader Service]] takes care of that. As above each ontology is stored in two graphs: the main nrl:Ontology graph containing all classes and properties and the nrl:GraphMetadata graph containing meta data about the ontology like the last modification date or the authors:
    Apart from actual data and meta data Nepomuk also stores the ontologies themselves in the database. The [[../OntologyLoaderService|Nepomuk Ontology Loader Service]] takes care of that. As above each ontology is stored in two graphs: the main nrl:Ontology graph containing all classes and properties and the nrl:GraphMetadata graph containing meta data about the ontology like the last modification date or the authors:


    <code>
    <syntaxhighlight lang="text">
    <http://www.semanticdesktop.org/ontologies/2007/08/15/nao> {
    <http://www.semanticdesktop.org/ontologies/2007/08/15/nao> {
       nao:hasDefaultNamespaceAbbreviation
       nao:hasDefaultNamespaceAbbreviation
    Line 183: Line 183:
    This is the information used by the classes in the [http://api.kde.org/4.x-api/kdelibs-apidocs/nepomuk/html/namespaceNepomuk_1_1Types.html Nepomuk::Types] namespace. It also allows the [[../QueryService|Nepomuk Query Service]] to match field names used in query strings to actual properties. This is done with a SPARQL query like the following:
    This is the information used by the classes in the [http://api.kde.org/4.x-api/kdelibs-apidocs/nepomuk/html/namespaceNepomuk_1_1Types.html Nepomuk::Types] namespace. It also allows the [[../QueryService|Nepomuk Query Service]] to match field names used in query strings to actual properties. This is done with a SPARQL query like the following:


    <code>
    <syntaxhighlight lang="text">
    select distinct ?p where {
    select distinct ?p where {
         ?p a rdf:Property .  
         ?p a rdf:Property .  
    Line 222: Line 222:
    To quickly demonstrate the technical side of it consider the following example.
    To quickly demonstrate the technical side of it consider the following example.


    <code>
    <syntaxhighlight lang="text">
       <nepomuk:/hsklueslfdh>
       <nepomuk:/hsklueslfdh>
           a pimo:Thing ;
           a pimo:Thing ;

    Revision as of 20:44, 29 June 2011


    Development/Tutorials/Metadata/Nepomuk/DataLayout


    Data Layout in Nepomuk

    Data Layout in Nepomuk

    Nepomuk is mainly about storing semantic data and meta data about resources on the desktop. This includes files, folders, emails, address book entries, but also tasks, projects, web pages, and much more.

    To really benefit from the data and the Nepomuk system as a developer it is important to know what is stored in Nepomuk and how it is stored. Only then can one use the data to the fullest and, more importantly, enrich it in ways that others may benefit also.

    In the following we give an overview of the layout of the data in Nepomuk and how it can be used. We start with the most basic type of data: file meta data.


    Layout of Automatically Added Data

    File Meta Data extracted by Strigi

    Nepomuk uses the Strigi file analysis system to extract meta data from files and cache it in Nepomuk for powerful desktop search. As described in RDF and Ontologies in Nepomuk| Strigi uses the NIE ontologies to store the data. For each analyzed file a new named graph (or context in Soprano terms) is created. For a more general discussion of graphs see NRL - The Nepomuk Representation Language.) All extracted meta data about the file is stored in this one graph which is then linked to the file resource itself via the property http://www.strigi.org/fields#indexGraphFor (be aware that this property will hopefully soon be replaced by a standard in NIE). To make this more concrete, let us look at an example:

    <syntaxhighlight lang="text"> <urn:nepomuk:local:eb343fa9-47ec-4dae-b8d0-fb10c7b63f3d> {

      <file:///home/trueg/nepomuk_kio.diff> 
         a nfo:FileDataObject ;
         nie:isPartOf <file:///home/trueg> ;
         nie:contentSize "4152"^^xsd:unsignedInt ;
         nie:plainTextContent "diff [...]" ;
         nie:mimeType "text/x-patch"^^xsd:string ;
         nfo:fileName "nepomuk_kio.diff"^^xsd:string ;
         nie:url <file:///home/trueg/nepomuk_kio.diff> ;
         nfo:characterCount> "4032"^^xsd:int ;
         nfo:wordCount "577"^^xsd:int ;
         nfo:lineCount "120"^^xsd:int ;
         nie:lastModified "2008-09-29T15:07:02Z"^^xsd:dateTime .
    

    }

    <urn:nepomuk:local:eb343fa9-47ec-4dae-b8d0-fb10c7b63f3d-metadata> {

      <urn:nepomuk:local:eb343fa9-47ec-4dae-b8d0-fb10c7b63f3d> 
         a nrl:InstanceBase ;
         nao:created "2009-06-03T08:19:18.465Z"^^xsd:dateTime ;
         <http://www.strigi.org/fields#indexGraphFor>
              <file:///home/trueg/nepomuk_kio.diff>  .
    
      <urn:nepomuk:local:eb343fa9-47ec-4dae-b8d0-fb10c7b63f3d-metadata>
         a nrl:GraphMetadata ;
         nrl:coreGraphMetadataFor 
             <urn:nepomuk:local:eb343fa9-47ec-4dae-b8d0-fb10c7b63f3d> .
    

    }

    This is the (slightly trimmed) data that the Nepomuk Strigi service created for the file nepomuk_kio.diff. Two named graphs have been created as required by NRL: 1. the nrl:InstanceBase graph containing all the meta data about the file and 2. the nrl:GraphMetadata graph containing meta data about the first graph. The first one is rather self-explanatory: it simply contains the key/value pairs extracted by Strigi like the mime-type or the size of the file. The second graph, however, is interested from a data layout point of view: it contains two interesting pieces of information: 1. the date at which the data was extracted by Strigi (the nao:created date of the graph) and 2. the URI of the first graph. The latter allows to very easily update the file meta data. One simply removes the complete graph and let's Strigi re-create it. All additional meta data such as tags or other links will be unchanged as it is not stored in the same graph (see below for details).

    Thus, if we were interested in getting only the meta data extracted by Strigi and not any manually added data we could use a query like the following:

    <syntaxhighlight lang="text"> select ?p ?o where {

        graph ?g { <file:///home/trueg/nepomuk_kio.diff> ?p ?o . } .
        ?g <http://www.strigi.org/fields#indexGraphFor>
            <file:///home/trueg/nepomuk_kio.diff> .
    

    }

    Or if we wanted the exact opposite: all data excluding Strigi extracted data (interesting for a backup maybe since Strigi data can be recreated from the file):

    <syntaxhighlight lang="text"> select ?p ?o where {

        graph ?g { <file:///home/trueg/nepomuk_kio.diff> ?p ?o . } .
        ?ig  <http://www.strigi.org/fields#indexGraphFor>
            <file:///home/trueg/nepomuk_kio.diff> .
        FILTER(?g != ?ig) .
    

    }

    Now that we saw how the file meta data from Strigi is stored we will have a look at the data created by such simple annotation tools as the Dolphin tagging and rating feature.


    Tagging, Rating, and so on via Nepomuk::Resource

    The typical way to create manual annotations via Nepomuk is to use Nepomuk::Resource as described in Handling Resources with Nepomuk. One either uses setProperty and addProperty or relies on the Nepomuk resource generator to generate convinience methods.

    In any case the created data has the same layout which will be familiar now. Let us look at another example of tagging a file:

    <syntaxhighlight lang="text"> <urn:nepomuk:local:9a70ed7b-1fc7-4680-9c65-efcf89010352> {

      <file:///home/trueg/nepomuk_kio.diff> 
         nao:lastModified "2009-08-21T09:14:37.702Z"^^xsd:dateTime ;
         nao:hasTag <nepomuk:/Nepomuk> .
         
      <nepomuk:/Nepomuk> 
         a nao:Tag ;
         nao:lastModified "2009-08-21T09:14:37.638Z"^^xsd:dateTime ;
         nao:created "2009-08-21T09:14:37.54Z"^^xsd:dateTime ;
         nao:identifier "Nepomuk"^^xsd:string ;
         nao:prefLabel "Nepomuk"^^xsd:string .
    

    }

    <urn:nepomuk:local:809c139c-07a3-4ef1-b2c7-4d4428708c36> {

      <urn:nepomuk:local:9a70ed7b-1fc7-4680-9c65-efcf89010352>
         a nrl:InstanceBase ;
         nao:created "2009-08-21T09:14:37.574Z"^^xsd:dateTime .
    
      <urn:nepomuk:local:809c139c-07a3-4ef1-b2c7-4d4428708c36>
         a nrl:GraphMetadata ;
         nrl:coreGraphMetadataFor
             <urn:nepomuk:local:9a70ed7b-1fc7-4680-9c65-efcf89010352> .
    

    }

    Again two graphs have been created. The first one containing the actual data and the second one containing meta data about the first one (most importantly the nao:created date). In the case of Nepomuk::Resource the first graph is used to group data created in the same transaction. So the main purpose of the graph is to remember the creation date. As we can see a tag is an actual resource and not only a string. Another thing worth noticing and maybe a bit confusing is the nao:lastModified date of the file resource: it has been changed although the modification date of the file itself did not change. Here we must not confuse nie:lastModified and nao:lastModifed. The former caches the actual file modification date in the file system while the latter saves the last modification date of the RDF resource in the Nepomuk store. The latter is created automatically by Nepomuk::Resource.

    Having the creation date of each graph allows us to perform statistical queries on annotation usage. We could for example check which were the last 5 annotations performed:

    <syntaxhighlight lang="text"> select ?p ?o where {

        graph ?g { ?r ?p ?o . } .
        ?g nao:created ?date .
    

    } ORDER BY DESC(?date) LIMIT 5

    (We could again exclude Strigi extracted data via a filter as above.)

    More complex things would include the usage frequency of tags or other things.

    Let us have another look at the nao:Tag. It also has a nao:lastModified date and a nao:created date automatically created by Nepomuk::Resource. The nao:prefLabel is set by Resource::setLabel (this should always be done). The nao:identifier is set automatically if Nepomuk::Resource is created via the constructor taking a QString argument rather than a QUrl. It allows to later re-create a Resource instance for the same tag without needing the resource URI. Nepomuk::Resource will match the nao:identifier to the string passed to the constructor.

    A comment on the URI of the tag: it is theoretically random. The use of the label is only intended to make the URI more human-readable. nepomuk:/34264345234124234523 would work as well.

    Now that we saw how Nepomuk::Resource stores data a few words about manually adding information to the Nepomuk database.


    Storing Ontology Data

    Apart from actual data and meta data Nepomuk also stores the ontologies themselves in the database. The Nepomuk Ontology Loader Service takes care of that. As above each ontology is stored in two graphs: the main nrl:Ontology graph containing all classes and properties and the nrl:GraphMetadata graph containing meta data about the ontology like the last modification date or the authors:

    <syntaxhighlight lang="text"> <http://www.semanticdesktop.org/ontologies/2007/08/15/nao> {

      nao:hasDefaultNamespaceAbbreviation
           a       rdf:Property ;
           rdfs:comment "Defines the default static namespace
    

    abbreviation for a graph" ;

           rdfs:domain nrl:Data ;
           rdfs:label "has default namespace abbreviation" ;
           rdfs:range rdfs:Literal ;
           rdfs:subPropertyOf nao:#annotation ;
           nrl:maxCardinality "1"^^xsd:nonNegativeInteger .
    
      nao:Symbol
           a       rdfs:Class ;
           rdfs:comment "Represents a symbol" ;
           rdfs:label "symbol" ;
           rdfs:subClassOf rdfs:Resource .
    
      [...]
    

    }

    <http://www.semanticdesktop.org/ontologies/2007/08/15/nao/metadata> {

      <http://www.semanticdesktop.org/ontologies/2007/08/15/nao/metadata>
           a       nrl:GraphMetadata ;
           nrl:coreGraphMetadataFor 
    

    <http://www.semanticdesktop.org/ontologies/2007/08/15/nao> .


      <http://www.semanticdesktop.org/ontologies/2007/08/15/nao>
           a       nrl:Ontology , nrl:DocumentGraph ;
           nao:hasDefaultNamespace 
    

    "http://www.semanticdesktop.org/ontologies/2007/08/15/nao#" ;

           nao:hasDefaultNamespaceAbbreviation "nao" ;
           nao:lastModified "2009-07-20T14:59:09.500Z" ;
           nao:serializationLanguage "TriG" ;
           nao:status "Unstable" ;
           nrl:updatable "0" ;
           nao:version "3" .
    

    }

    This is the information used by the classes in the Nepomuk::Types namespace. It also allows the Nepomuk Query Service to match field names used in query strings to actual properties. This is done with a SPARQL query like the following:

    <syntaxhighlight lang="text"> select distinct ?p where {

        ?p a rdf:Property . 
        ?p rdfs:label ?label .
        FILTER(REGEX(STR(?label),'tag','i')) .
    

    }

    This will match a query string like tag:nepomuk to a property. It will actually match nao:hasTag which has rdfs:range set to nao:Tag. Thus, the query engine will then look for any nao:Tag instances with a label that matches nepomuk.


    Manually Adding Data

    Apart from using Nepomuk::Resource it is perfectly possible to add data manually through the very low level methods provided by Soprano like Soprano::Model::addStatement. This can be done using the Soprano::Model provided by Nepomuk::ResourceManager::mainModel.

    When adding data manually one should make sure the data follows the same layout described above: one nrl:InstanceBase for the actual data and one nrl:GraphMetadata for the nao:created date. One way to handle this is through Soprano::NrlModel but be aware that the API is not stable yet. It might make sense to copy the class into ones own codebase:

    <syntaxhighlight lang="cpp-qt"> Soprano::NrlModel model(Nepomuk::ResourceManager::instance()->mainModel()); QUrl g = model.createGraph(Soprano::Vocabulary::NRL::InstanceBase()); model.addStatement(...., g); model.addStatement(...., g); model.addStatement(...., g); [...]

    This will create a new nrl:InstanceBase graph including its nrl:GraphMetadata which stores the nao:created date. The graph URI can then be used to add new statements.


    One Step Further: PIMO

    For KDE 4.4 the plan is to introduce PIMO - The Personal Information Model Ontology, or at least parts of it. The basic idea here is to use PIMO classes to represent real-world entities like persons, projects, countries, and so on and then link those to desktop resources like files, address book entries, and so on. Two typical examples make this clearer.

    The one person resource representing the real world person provides a sort of meta-contact for all address book entries, email addresses, and instant messaging account referring to the same person. All annotations would be done on the person resource rather than its occurrences. This ensures that annotations done in the address book will also show up in the instant messenger. This is something targeted by the Telepathy project in KDE.

    The invoice which is stored in an ODF file but also has an exported PDF version and has been attached to an email. All three representations (occurrences in PIMO terminology) can be grouped by one PIMO resource that has as type Invoice. Again, annotations are done on the Invoice resource rather then the files. One can think of the PIMO resource as the content or the spirit of the document.

    To quickly demonstrate the technical side of it consider the following example.

    <syntaxhighlight lang="text">

      <nepomuk:/hsklueslfdh>
         a pimo:Thing ;
         pimo:groundingOccurrence <file:///home/trueg/nepomuk_kio.diff> ;
         nao:lastModified "2009-08-21T09:14:37.702Z"^^xsd:dateTime ;
         nao:hasTag <nepomuk:/Nepomuk> .
    

    Here we have a pimo:Thing (the base class for all PIMO resources) which has as its pimo:groundingOccurrence the already familiar diff file. Instead of tagging the file resource itself we tagged the pimo thing.

    TODO: it is not settled yet if this really brings a benefit for files or if files should be the exception to the rule, meaning they should always be annotated directly.