Projects/Nepomuk/DataFeeders: Difference between revisions

    From KDE TechBase
    (→‎Identification: Make it pretty)
    (Removed page from translation)
     
    (11 intermediate revisions by 2 users not shown)
    Line 1: Line 1:
    Some applications need to push large quantities of data into Nepomuk. They are typically called "feeder" applications as they provide Nepomuk with the data it requires. A database is only as powerful as the data it holds.


    While one can use the <code>Resource</code> class to push the data. It'll be slow as the <code>Resource</code> class is synchronous and writes back into the database after each command. What one requires is an asynchronous API to push the application can just write all the data, and then Nepomuk can process and merge all of the data provided with its internal database. 
    == Getting Data into Nepomuk ==


    = SimpleResources =
    Some applications need to push large quantities of data into '''Nepomuk'''. They are typically called ''feeder'' applications as they provide '''Nepomuk''' with the data it requires. A database is only as powerful as the data it holds.


    Applications can use the <code>SimpleResource</code> class to model the data that they want to push.  The <code>SimpleResource</code> class is not connected to the Nepomuk database, and is just a convenience wrapper around a <code>QMultiHash</code>. Any changes made to these SimpleResources are not reflected back to the database, unless explicitly specified.
    While one can use the <tt>Resource</tt> class to push the data. It'll be slow as the <tt>Resource</tt> class is synchronous and writes back into the database after each command. What one requires is an asynchronous API to push the application, just writing all the data, and then '''Nepomuk''' can process and merge all of the data provided with its internal database. 
     
    == SimpleResources ==
     
    Applications can use the <tt>SimpleResource</tt> class to model the data that they want to push.  The <tt>SimpleResource</tt> class is not connected to the '''Nepomuk''' database, and is just a convenience wrapper around a <tt>QMultiHash</tt>. Any changes made to these SimpleResources are not reflected back to the database, unless explicitly specified.


    An example -
    An example -
    Line 28: Line 31:




    In the above example we wish to push data about a song "What If" by the popular english artist "Coldplay". We create a different SimpleResource for each resource that we want to push into Nepomuk, and then add the relevant metadata. These <code>SimpleResource</code>s can reference each other.
    In the above example we wish to push data about a song "What If" by the popular English artist "Coldplay". We create a different SimpleResource for each resource that we want to push into '''Nepomuk''', and then add the relevant metadata. These <tt>SimpleResource</tt>s can reference each other.


    All of this data is currently just stored in memory in a hash table. In order to push the data into Nepomuk, we group it all together using a <code>SimpleResourceGraph</code>. After which was can push the data by calling <code>SimpleResourceGraph::save()</code>.
    All of this data is currently just stored in memory in a hash table. In order to push the data into '''Nepomuk''', we group it all together using a <tt>SimpleResourceGraph</tt>. After which was can push the data by calling <tt>SimpleResourceGraph::save()</tt>.


    Example -
    Example -
    Line 40: Line 43:
    </syntaxhighlight>
    </syntaxhighlight>


    The save operation returns a KJob which has already begun execution. This operation will continue asynchronously, and on completion will emit a [http://api.kde.org/4.8-api/kdelibs-apidocs/kdecore/html/classKJob.html#a67b6c63fc5eb7bd31234960e7a5487d9 signal] on completion.


    The save operation returns a KJob which has already begun execution. This operation will continue asynchronously, and on completion will emit the signal completed.
    The completed signals also return the respective KJob. This job can then be checked for errors, which may have occurred if we tried to save invalid data. It is up to the programmer to make sure that the data is valid. Invalid valid data is completely ignored and an error is given.


    The completed signals also return the respective KJob. This job can then be checked for errors, which may have occurred if we tried to save invalid data. It is up to the programmer to make sure that the data is valid. Invalid valid data is completely ignored and an error is given.
    == StoreResources ==


    = StoreResources =
    Calling the [http://api.kde.org/4.x-api/kdelibs-apidocs/nepomuk-core/html/classNepomuk2_1_1SimpleResourceGraph.html#a9af47ab6961f1f7a5b348c50398fa1dd <tt>SimpleResourceGraph::save</tt>] operational, internally calls the [http://api.kde.org/4.x-api/kdelibs-apidocs/nepomuk-core/html/classNepomuk2_1_1StoreResourcesJob.html <tt>StoreResourcesJob</tt>] with its default parameters. For more specialized use cases [http://api.kde.org/4.x-api/kdelibs-apidocs/nepomuk-core/html/group__nepomuk__datamanagement.html#gac094511c6157fb1f96f3a3344d48128d storeResources] can directly be called.


    Calling the <code>SimpleResourceGraph::save</code> operational, internally calls the <code>StoreResourcesJob</code> with its default parameters. For more specialized use cases storeResources can directly be called.
    The [http://api.kde.org/4.x-api/kdelibs-apidocs/nepomuk-core/html/group__nepomuk__datamanagement.html#gac094511c6157fb1f96f3a3344d48128d storeResources] function is a lengthy procedure that has performs multiple operations on the data after which it pushes the data into Nepomuk. The two main parts of the job are outlined below.


    The storeResources is a lengthy procedure that has performs multiple operations on the data after which it pushes the data into Nepomuk. The two main parts of the job are outlined below.
    === Identification ===


    == Identification ==
    Each SimpleResource contains a uri, which is either an actual uri of the form <tt>nepomuk:/res/some-unique-identifier</tt> or is a blank uri of the form <tt>_:identifier</tt>. The SimpleResources which contain resource uris can just directly be pushed into '''Nepomuk'''. The blank uris require some additional processing.
    Each SimpleResource contains a uri, which is either an actual uri of the form <code>nepomuk:/res/some-unique-identifier</code> or is a blank uri of the form <code>_:identifier</code>. The SimpleResources which contain resource uris can just directly be pushed into Nepomuk. The blank uris require some additional processing.


    Each SimpleResource with a blank uri needs to be translated to a corresponding nepomuk resource uri, if that resource already exists. Otherwise a new resource needs to be created. This translation process is called resource identification. It is performed using the properties specified in the SimpleResource.  
    Each SimpleResource with a blank uri needs to be translated to a corresponding nepomuk resource uri, if that resource already exists. Otherwise a new resource needs to be created. This translation process is called resource identification. It is performed using the properties specified in the SimpleResource.  


    Certain properties in the ontologies are marked as [http://oscaf.sourceforge.net/nrl.html#nrl:DefiningProperty| defining properties]. The criteria is decided as follows -
    Certain properties in the ontologies are marked as [http://oscaf.sourceforge.net/nrl.html#nrl:DefiningProperty defining properties]. The criteria is decided as follows -
    * Properties with a literal range are always defining, unless explicitly marked as a [http://oscaf.sourceforge.net/nrl.html#nrl:NonDefiningProperty|nrl:NonDefiningProperty]
    * Properties with a literal range are always defining, unless explicitly marked as a [http://oscaf.sourceforge.net/nrl.html#nrl:NonDefiningProperty nrl:NonDefiningProperty]
    * Properties with a resource range are always NOT defining, unless explicitly marked with [http://oscaf.sourceforge.net/nrl.html#nrl:DefiningProperty|nrl:DefiningProperty]
    * Properties with a resource range are always NOT defining, unless explicitly marked with [http://oscaf.sourceforge.net/nrl.html#nrl:DefiningProperty nrl:DefiningProperty]


    Two resources are said to match each other if the following criteria are met -
    Two resources are said to match each other if the following criteria are met -
    * Their list of <code>rdf:type</code>s matches.
    * Their list of <tt>rdf:type</tt>s matches.
    * The resources do not have any defining properties which do not match.
    * The resources do not have any defining properties which do not match.
    * At least one defining property matches.
    * At least one defining property matches.


    - Provide an example -
    ==== Example ====
     
    If the following resource already exists in the Nepomuk Repository -
    <syntaxhighlight lang="text">
        <nepomuk:/res/A>
            rdf:type nco:PersonContact ;
            nco:fullname "Peter Parker" ;
            nco:gender nco:male .
    </syntaxhighlight>
     
     
    And then the following data is pushed -
    <syntaxhighlight lang="cpp-qt">
        SimpleResource peter;
        peter.addType( NCO::PersonContact() );
        peter.setProperty( NCO::fullname(), QLatin1String("Peter Parker") );
     
        SimpleResource spiderman;
        spiderman.addType( NCO::PersonContact() );
        spiderman.setProperty( NCO::fullname(), QLatin1String("Spiderman") );
        spiderman.setProperty( NCO::gender(), NCO::male() );
    </syntaxhighlight>
     
    In this case <tt>peter</tt> will be mapped to <tt>nepomuk:/res/A</tt> since it has the same type and all the identifying properties match (nco:fullname). It doesn't matter that nco:gender does not match, as the <tt>peter</tt> doesn't specify a gender. If in a alternative universe <tt>peter</tt> was specified as a <tt>nco:female</tt> in the <tt>SimpleResource</tt> then <tt>peter</tt> would not have been mapped to <tt>nepomuk:/res/A</tt>
     
    <tt>spiderman</tt> does not match any existing contacts, so a new resource with a uri of the form <tt>nepomuk:/res/uuid</tt> is created with the specified properties. That uri can be fetched as follows <tt>simpleResourceJob->mappings( spiderman.uri() )</tt>
     
    === Merging ===


    All blank uris are either mapped to existing resources or new resources are created.
    Once the identification process has been completed, each SimpleResource goes through a series of checks which check if the domain, range and carnality of properties is correct, and then pushes the data into the database after merging the graphs for the statements that already exist, and creating a new graph for the new statements.


    == Merging ==
    == Common Errors ==
    ;1. <Property> has a max cardinality of <value>. Provided <n> values - <list>. Existing - <list>
    :The error indicates that you're not following the cardinality restrictions that are present in the ontologies. For example [http://oscaf.sourceforge.net/nco.html#nco:fullname nco:fullname] has a max cardinality 1. That means that any resource can at max have one full name. You have probably given your SimpleResource Contact two full names.


    Once the identification process has been completed, each SimpleResource goes through a series of checks which check if the domain, range and carnality of properties is correct, and then pushes the data into the database after merging the graphs for the statements that already exist, and creating a new graph for the new statements.  
    ;2. <Property> has rdfs:domain/rdfs:range of <Type>. <Resource> only has the following types
    :If <Resource> is of the form <tt>_:identifier</tt> then it means that your SimpleResource with identifier <Resource> is missing the types given. Otherwise if it is of the form <tt>nepomuk:/res/unique-uuid</tt> that implies that either your SimpleResource was identifier as <Resource> and that resource does not have the respective types, or that you are trying to map it to a resource which does not contain that type.


    = Who else is using it? =
    == Using the data after pushing ==


    The SimpleResource API is currently the de facto method of pushing data into Nepomuk. It is being heavily utilized by our own file indexer, and KDE PIM. PIM uses the SimpleResource api in order to push emails, contacts and event information into Nepomuk.
    In some applications you may need to access the data after you have pushed it into '''Nepomuk''' using <tt>storeResources</tt>. Fortunately there is a convenient way to do that. The SimpleResourceJob provides a function calling mappings, which lets you map the SimpleResource uris to the actual nepomuk uris once they have been saved.
     
    Example -
    <syntaxhighlight lang="cpp-qt">
        using namespace Nepomuk2::Vocabulary;
     
        SimpleResource email;
        email.addType( NCO::EmailAddress() );
        email.addProperty( NCO::emailAddress(), QLatin1String("[email protected]") );
     
        SimpleResource contact;
        contact.addType( NCO::Contact() );
        contact.setProperty( NCO::fullname(), QLatin1String("Peter Parker") );
        contact.addProperty( NCO::hasEmailAddress(), email );
     
        SimpleResourceGraph graph;
        graph << contact << email;
     
        StoreResourcesJob* job = graph.save();
        job->exec();
        QASSERT( !job->error() );
     
        QUrl emailUri = job->mappings().value( email.uri() );
        QUrl contactUri = job->mappings().value( contact.uri() );
    </syntaxhighlight>
     
     
    Here the <tt>email.uri()</tt> function will return a uri of the form <tt>_:identifier</tt>. Same is the case with <tt>contact.uri()</tt>. The <tt>StoreResourcesJob::mappings</tt> returns a <tt>QHash</tt> which maps these blank uris to their respective nepomuk uris. They can then be used as follows -
     
    <syntaxhighlight lang="cpp-qt">
        Nepomuk2::Resource contactRes( contactUri );
        const QString fullname = contactRes.property( NCO::fullname() ).toString();
    </syntaxhighlight>
     
    == Who else is using it? ==
     
    The SimpleResource API is currently the de facto method of pushing data into '''Nepomuk'''. It is being heavily utilized by our own file indexer, and KDE PIM. PIM uses the SimpleResource api in order to push emails, contacts and event information into '''Nepomuk'''.


    For more examples on how to use SimpleResource, we suggest you look at our comprehensive tests present in the datamanagementmodel. Add link!!
    For more examples on how to use SimpleResource, we suggest you look at our comprehensive tests present in the datamanagementmodel. Add link!!


    = Graph Handling =
    == Graph Handling ==


    Most developers do not need to worry about graphs present in Nepomuk. However, for the sake of completion we're documenting what happens internally. Hopefully, this will help you better understand the intricacies on Nepomuk.
    Most developers do not need to worry about graphs present in '''Nepomuk'''. However, for the sake of completion we're documenting what happens internally. Hopefully, this will help you better understand the intricacies on '''Nepomuk'''.


    When a <code>SimpleResourceGraph</code> is saved or passed onto <code>storeResources</code>, each statement in the graph is checked for existance in the database. If that triple already exists, it is set aside and specially handled. All other triples are pushed into this one big graph that is created with each call to <code>storeResources</code>.  
    When a <tt>SimpleResourceGraph</tt> is saved or passed onto <tt>storeResources</tt>, each statement in the graph is checked for existance in the database. If that triple already exists, it is set aside and specially handled. All other triples are pushed into this one big graph that is created with each call to <tt>storeResources</tt>.  


    That graph contains the following data -
    That graph contains the following data -
    Line 92: Line 161:


    When ..
    When ..
    [[Category:Documentation]]
    [[Category:Development]]
    [[Category:Tutorials]]

    Latest revision as of 12:34, 9 February 2018

    Getting Data into Nepomuk

    Some applications need to push large quantities of data into Nepomuk. They are typically called feeder applications as they provide Nepomuk with the data it requires. A database is only as powerful as the data it holds.

    While one can use the Resource class to push the data. It'll be slow as the Resource class is synchronous and writes back into the database after each command. What one requires is an asynchronous API to push the application, just writing all the data, and then Nepomuk can process and merge all of the data provided with its internal database.

    SimpleResources

    Applications can use the SimpleResource class to model the data that they want to push. The SimpleResource class is not connected to the Nepomuk database, and is just a convenience wrapper around a QMultiHash. Any changes made to these SimpleResources are not reflected back to the database, unless explicitly specified.

    An example -

        Nepomuk2::SimpleResource coldplay;
        coldplay.addType( NCO::Contact() );
        coldplay.addProperty( NCO::fullname(), "Coldplay" );
    
        Nepomuk2::SimpleResource album;
        album.addType( NMM::MusicAlbum() );
        album.addProperty( NIE::title(), "X&Y" );
    
        Nepomuk2::SimpleResource fileRes;
        fileRes.addType( NFO::FileDataObject() );
        fileRes.addType( NMM::MusicPiece() );
        fileRes.addProperty( NMM::performer(), coldplay );
        fileRes.addProperty( NMM::musicAlbum(), album );
        fileRes.addProperty( NIE::url(), fileUrl );
        fileRes.addProperty( NIE::title(), "What If" );
    


    In the above example we wish to push data about a song "What If" by the popular English artist "Coldplay". We create a different SimpleResource for each resource that we want to push into Nepomuk, and then add the relevant metadata. These SimpleResources can reference each other.

    All of this data is currently just stored in memory in a hash table. In order to push the data into Nepomuk, we group it all together using a SimpleResourceGraph. After which was can push the data by calling SimpleResourceGraph::save().

    Example -

        Nepomuk2::SimpleResourceGraph graph;
        graph << coldplay << album << fileRes;
    
        KJob* job = graph.save();
    

    The save operation returns a KJob which has already begun execution. This operation will continue asynchronously, and on completion will emit a signal on completion.

    The completed signals also return the respective KJob. This job can then be checked for errors, which may have occurred if we tried to save invalid data. It is up to the programmer to make sure that the data is valid. Invalid valid data is completely ignored and an error is given.

    StoreResources

    Calling the SimpleResourceGraph::save operational, internally calls the StoreResourcesJob with its default parameters. For more specialized use cases storeResources can directly be called.

    The storeResources function is a lengthy procedure that has performs multiple operations on the data after which it pushes the data into Nepomuk. The two main parts of the job are outlined below.

    Identification

    Each SimpleResource contains a uri, which is either an actual uri of the form nepomuk:/res/some-unique-identifier or is a blank uri of the form _:identifier. The SimpleResources which contain resource uris can just directly be pushed into Nepomuk. The blank uris require some additional processing.

    Each SimpleResource with a blank uri needs to be translated to a corresponding nepomuk resource uri, if that resource already exists. Otherwise a new resource needs to be created. This translation process is called resource identification. It is performed using the properties specified in the SimpleResource.

    Certain properties in the ontologies are marked as defining properties. The criteria is decided as follows -

    • Properties with a literal range are always defining, unless explicitly marked as a nrl:NonDefiningProperty
    • Properties with a resource range are always NOT defining, unless explicitly marked with nrl:DefiningProperty

    Two resources are said to match each other if the following criteria are met -

    • Their list of rdf:types matches.
    • The resources do not have any defining properties which do not match.
    • At least one defining property matches.

    Example

    If the following resource already exists in the Nepomuk Repository -

        <nepomuk:/res/A>
            rdf:type nco:PersonContact ;
            nco:fullname "Peter Parker" ;
            nco:gender nco:male .
    


    And then the following data is pushed -

        SimpleResource peter;
        peter.addType( NCO::PersonContact() );
        peter.setProperty( NCO::fullname(), QLatin1String("Peter Parker") );
    
        SimpleResource spiderman;
        spiderman.addType( NCO::PersonContact() );
        spiderman.setProperty( NCO::fullname(), QLatin1String("Spiderman") );
        spiderman.setProperty( NCO::gender(), NCO::male() );
    

    In this case peter will be mapped to nepomuk:/res/A since it has the same type and all the identifying properties match (nco:fullname). It doesn't matter that nco:gender does not match, as the peter doesn't specify a gender. If in a alternative universe peter was specified as a nco:female in the SimpleResource then peter would not have been mapped to nepomuk:/res/A

    spiderman does not match any existing contacts, so a new resource with a uri of the form nepomuk:/res/uuid is created with the specified properties. That uri can be fetched as follows simpleResourceJob->mappings( spiderman.uri() )

    Merging

    Once the identification process has been completed, each SimpleResource goes through a series of checks which check if the domain, range and carnality of properties is correct, and then pushes the data into the database after merging the graphs for the statements that already exist, and creating a new graph for the new statements.

    Common Errors

    1. <Property> has a max cardinality of <value>. Provided <n> values - <list>. Existing - <list>
    The error indicates that you're not following the cardinality restrictions that are present in the ontologies. For example nco:fullname has a max cardinality 1. That means that any resource can at max have one full name. You have probably given your SimpleResource Contact two full names.
    2. <Property> has rdfs
    domain/rdfs:range of <Type>. <Resource> only has the following types
    If <Resource> is of the form _:identifier then it means that your SimpleResource with identifier <Resource> is missing the types given. Otherwise if it is of the form nepomuk:/res/unique-uuid that implies that either your SimpleResource was identifier as <Resource> and that resource does not have the respective types, or that you are trying to map it to a resource which does not contain that type.

    Using the data after pushing

    In some applications you may need to access the data after you have pushed it into Nepomuk using storeResources. Fortunately there is a convenient way to do that. The SimpleResourceJob provides a function calling mappings, which lets you map the SimpleResource uris to the actual nepomuk uris once they have been saved.

    Example -

        using namespace Nepomuk2::Vocabulary;
    
        SimpleResource email;
        email.addType( NCO::EmailAddress() );
        email.addProperty( NCO::emailAddress(), QLatin1String("[email protected]") );
    
        SimpleResource contact;
        contact.addType( NCO::Contact() );
        contact.setProperty( NCO::fullname(), QLatin1String("Peter Parker") );
        contact.addProperty( NCO::hasEmailAddress(), email );
    
        SimpleResourceGraph graph;
        graph << contact << email;
    
        StoreResourcesJob* job = graph.save();
        job->exec();
        QASSERT( !job->error() );
    
        QUrl emailUri = job->mappings().value( email.uri() );
        QUrl contactUri = job->mappings().value( contact.uri() );
    


    Here the email.uri() function will return a uri of the form _:identifier. Same is the case with contact.uri(). The StoreResourcesJob::mappings returns a QHash which maps these blank uris to their respective nepomuk uris. They can then be used as follows -

        Nepomuk2::Resource contactRes( contactUri );
        const QString fullname = contactRes.property( NCO::fullname() ).toString();
    

    Who else is using it?

    The SimpleResource API is currently the de facto method of pushing data into Nepomuk. It is being heavily utilized by our own file indexer, and KDE PIM. PIM uses the SimpleResource api in order to push emails, contacts and event information into Nepomuk.

    For more examples on how to use SimpleResource, we suggest you look at our comprehensive tests present in the datamanagementmodel. Add link!!

    Graph Handling

    Most developers do not need to worry about graphs present in Nepomuk. However, for the sake of completion we're documenting what happens internally. Hopefully, this will help you better understand the intricacies on Nepomuk.

    When a SimpleResourceGraph is saved or passed onto storeResources, each statement in the graph is checked for existance in the database. If that triple already exists, it is set aside and specially handled. All other triples are pushed into this one big graph that is created with each call to storeResources.

    That graph contains the following data - <nepomuk:/ctx/some-graph> a nrl:Graph . get some data!!

    When ..