Projects/Nepomuk/IndexingPlugin: Difference between revisions

    From KDE TechBase
    (Prepare for translation)
    (Marked this version for translation)
    (3 intermediate revisions by 2 users not shown)
    Line 2: Line 2:
    <translate>
    <translate>


    == Status of Indexing ==
    == Status of Indexing == <!--T:1-->


    <!--T:2-->
    File Indexing has gone through a major overhaul in 4.10. We no longer rely on '''strigi'''. This means that we need to write our own file indexer from scratch. However writing a file indexer is very simple.
    File Indexing has gone through a major overhaul in 4.10. We no longer rely on '''strigi'''. This means that we need to write our own file indexer from scratch. However writing a file indexer is very simple.


    Currently, there is no public interface for the indexing plugins. There might be one for 4.10, but we aren't sure right now.
    <!--T:25-->
    At this time all the indexers live in <tt>nepomuk-core/services/fileindexer/indexer</tt>. They use a private library that is not exported. Therefore public indexers for custom formats are currently not allowed. All indexers must go in the nepomuk-core repository.


    == Extractor Plugin ==
    == Extractor Plugin == <!--T:4-->
    In order to write a file indexer, we have to write a plugin derived from <tt>Nepomuk2::ExtractorPlugin</tt>.  We are required to implement two simple functions -
    In order to write a file indexer, we have to write a plugin derived from <tt>Nepomuk2::ExtractorPlugin</tt>.  We are required to implement two simple functions -


    <!--T:5-->
    <syntaxhighlight lang="cpp-qt">
    <syntaxhighlight lang="cpp-qt">
    class NEPOMUK_EXPORT ExtractorPlugin : public QObject
    class NEPOMUK_EXPORT ExtractorPlugin : public QObject
    Line 19: Line 22:
         virtual ~ExtractorPlugin();
         virtual ~ExtractorPlugin();


         virtual QStringList mimetypes() = 0;
         <!--T:6-->
    virtual QStringList mimetypes() = 0;
         virtual SimpleResourceGraph extract(const QUrl& resUri, const QUrl& fileUrl, const QString& mimeType) = 0;
         virtual SimpleResourceGraph extract(const QUrl& resUri, const QUrl& fileUrl, const QString& mimeType) = 0;
    };
    };
    </syntaxhighlight>
    </syntaxhighlight>


    <!--T:7-->
    These two functions are <tt>mimetypes</tt> and <tt>extract</tt>. Each plugin can act on a certain set of mimetypes. Each plugin simply needs to list out all the mimetypes they support.
    These two functions are <tt>mimetypes</tt> and <tt>extract</tt>. Each plugin can act on a certain set of mimetypes. Each plugin simply needs to list out all the mimetypes they support.


    <!--T:8-->
    The second function <tt>extract</tt> is the heart of the extractor. You are provided with the mimetype and the url of the file. The file can be read and information can be extracted from it.
    The second function <tt>extract</tt> is the heart of the extractor. You are provided with the mimetype and the url of the file. The file can be read and information can be extracted from it.


    === Saving the Extracted Data ===
    === Saving the Extracted Data === <!--T:9-->


    <!--T:10-->
    The Nepomuk Extractors are based around two simple classes <tt>SimpleResource</tt> and <tt>SimpleResourceGraph</tt>. The SimpleResourceGraph is just a collection of <tt>SimpleResource</tt>s. A <tt>SimpleResource</tt> is just a collection of (key, value) pairs which contain the properties of that particular resource.
    The Nepomuk Extractors are based around two simple classes <tt>SimpleResource</tt> and <tt>SimpleResourceGraph</tt>. The SimpleResourceGraph is just a collection of <tt>SimpleResource</tt>s. A <tt>SimpleResource</tt> is just a collection of (key, value) pairs which contain the properties of that particular resource.


    <!--T:11-->
    The main file resource has a resource uri which is passed as a parameter. It can be used as follows -
    The main file resource has a resource uri which is passed as a parameter. It can be used as follows -


    <!--T:12-->
    <syntaxhighlight lang="cpp-qt">
    <syntaxhighlight lang="cpp-qt">
         SimpleResource fileRes( resUri );
         SimpleResource fileRes( resUri );
    Line 44: Line 53:




    <!--T:13-->
    This <tt>fileRes</tt> can then be added to a <tt>SimpleResourceGraph</tt> and returned. It will then be saved in Nepomuk.
    This <tt>fileRes</tt> can then be added to a <tt>SimpleResourceGraph</tt> and returned. It will then be saved in Nepomuk.


    == Required Files ==
    == Required Files == <!--T:14-->


    <!--T:15-->
    Since the plugin interface still isn't public. It would be best to directly contribute to nepomuk-core. The relevant code can be found at nepomuk-core/services/fileindexer/indexer/.
    Since the plugin interface still isn't public. It would be best to directly contribute to nepomuk-core. The relevant code can be found at nepomuk-core/services/fileindexer/indexer/.


    == Testing the Indexer ==
    == Testing the Indexer == <!--T:16-->


    The Indexer is generally automatically called when it detects new files should be indexed. It however can also be forcibly called by running <tt>nepomukindexer --debug fileUrl</tt> on a file.
    <!--T:26-->
    In order to test the indexer, you should call it manually on the specified file by executing <tt>nepomukindexer fileUrl</tt>. If there were no errors, then the file should have been indexed correctly.


    The extra <tt>--debug</tt> is required because normally, the nepomukindexer process only adds details from the plugins. The basic information about the file - It's url, mimetype, etc, and supposed to already exist. The debug option adds that information as well.
    <!--T:27-->
    You can view the indexed data by running <tt>nepomukshow fileUrl</tt>. This 'nepomukshow' tool does not output the plain text content by default. You can print the plain text by calling <tt>nepomukshow --plainText fileUrl</tt>


    === Viewing the indexed information ===
    === Errors === <!--T:21-->
     
    The most commonly used method is the sidebar in Dolphin. However, one can also use this [[Projects/Nepomuk/NepomukShow| nifty tool]] to view the data.
     
    === Errors ===


    <!--T:22-->
    It might be common to get errors that a properties range/domain/cardinality is not being followed. These errors occur when the ontologies are not being properly followed. In that case it would be best to look where you're adding that property and if it actually has the correct domain/range/cardinality.
    It might be common to get errors that a properties range/domain/cardinality is not being followed. These errors occur when the ontologies are not being properly followed. In that case it would be best to look where you're adding that property and if it actually has the correct domain/range/cardinality.


    <!--T:23-->
    The ontologies can be found over here - http://oscaf.sourceforge.net/
    The ontologies can be found over here - http://oscaf.sourceforge.net/


    <!--T:24-->
    [[Category:Documentation]]
    [[Category:Documentation]]
    [[Category:Development]]
    [[Category:Development]]
    [[Category:Tutorials]]
    [[Category:Tutorials]]
    </translate>
    </translate>

    Revision as of 10:12, 30 May 2013


    Status of Indexing

    File Indexing has gone through a major overhaul in 4.10. We no longer rely on strigi. This means that we need to write our own file indexer from scratch. However writing a file indexer is very simple.

    At this time all the indexers live in nepomuk-core/services/fileindexer/indexer. They use a private library that is not exported. Therefore public indexers for custom formats are currently not allowed. All indexers must go in the nepomuk-core repository.

    Extractor Plugin

    In order to write a file indexer, we have to write a plugin derived from Nepomuk2::ExtractorPlugin. We are required to implement two simple functions -

    class NEPOMUK_EXPORT ExtractorPlugin : public QObject
    {
        Q_OBJECT
    public:
        ExtractorPlugin(QObject* parent);
        virtual ~ExtractorPlugin();
    
        virtual QStringList mimetypes() = 0;
        virtual SimpleResourceGraph extract(const QUrl& resUri, const QUrl& fileUrl, const QString& mimeType) = 0;
    };
    

    These two functions are mimetypes and extract. Each plugin can act on a certain set of mimetypes. Each plugin simply needs to list out all the mimetypes they support.

    The second function extract is the heart of the extractor. You are provided with the mimetype and the url of the file. The file can be read and information can be extracted from it.

    Saving the Extracted Data

    The Nepomuk Extractors are based around two simple classes SimpleResource and SimpleResourceGraph. The SimpleResourceGraph is just a collection of SimpleResources. A SimpleResource is just a collection of (key, value) pairs which contain the properties of that particular resource.

    The main file resource has a resource uri which is passed as a parameter. It can be used as follows -

        SimpleResource fileRes( resUri );
        fileRes.addType( NFO::PlainTextDocument() );
        fileRes.addProperty( NIE::plainTextContent(), contents );
        fileRes.addProperty( NFO::wordCount(), words );
        fileRes.addProperty( NFO::lineCount(), lines );
        fileRes.addProperty( NFO::characterCount(), characters );
    


    This fileRes can then be added to a SimpleResourceGraph and returned. It will then be saved in Nepomuk.

    Required Files

    Since the plugin interface still isn't public. It would be best to directly contribute to nepomuk-core. The relevant code can be found at nepomuk-core/services/fileindexer/indexer/.

    Testing the Indexer

    In order to test the indexer, you should call it manually on the specified file by executing nepomukindexer fileUrl. If there were no errors, then the file should have been indexed correctly.

    You can view the indexed data by running nepomukshow fileUrl. This 'nepomukshow' tool does not output the plain text content by default. You can print the plain text by calling nepomukshow --plainText fileUrl

    Errors

    It might be common to get errors that a properties range/domain/cardinality is not being followed. These errors occur when the ontologies are not being properly followed. In that case it would be best to look where you're adding that property and if it actually has the correct domain/range/cardinality.

    The ontologies can be found over here - http://oscaf.sourceforge.net/