Difference between revisions of "Development/Tutorials/Writing file analyzers"

Jump to: navigation, search
(Testing your code)
m (Text replace - "<syntaxhighlight lang="make">" to "<syntaxhighlight lang="cmake">")
 
(44 intermediate revisions by 20 users not shown)
Line 1: Line 1:
= Note: this tutorial is not finished yet =
 
 
 
= Writing KDE4 file analyzers =
 
= Writing KDE4 file analyzers =
  
Line 11: Line 9:
 
=== What are file analyzers? ===
 
=== What are file analyzers? ===
  
 +
A file analyzer is a class that extracts metadata from a file or data
 +
stream. You can have file analyzers that are specific for certain file
 +
types such as an analyzer that extracts the information from an ogg
 +
vorbis file. There are also more general file analyzers that calculate
 +
for example the md5 or sha1 of a file.
  
 
=== File analyzers in KDE4 ===
 
=== File analyzers in KDE4 ===
Line 20: Line 23:
 
Writing stream-based analyzers requires a different approach than the usual file-based methods and in the tutorial we will explain how to go about it.
 
Writing stream-based analyzers requires a different approach than the usual file-based methods and in the tutorial we will explain how to go about it.
  
== Finding documentation ==
+
The current state of porting the KDE3 kfile plugins to KDE4 stream analyzers can be seen at http://wiki.kde.org/tiki-index.php?page=Porting+KFilePlugin+Progress.
  
 
== Look for existing code ==
 
== Look for existing code ==
If you want to see some code examples, take a look at the already implemented file analyzers at [http://websvn.kde.org/trunk/kdesupport/strigi/src/streamindexer/ /kdesupport/strigi/src/streamindexer/]
+
If you want to see some code examples, take a look at the already implemented file analyzers at [http://websvn.kde.org/trunk/kdesupport/strigi/src/streamanalyzer/ /kdesupport/strigi/src/streamanalyzer/].
 +
 
 +
Some examples of meta-data extraction from files can also be found in the
 +
[http://www.hachoir.org Hachoir project] in the [http://hachoir.org/browser/trunk/hachoir-parser/hachoir_parser online parser sourcecode].
  
 
== Choosing the type of analyzer ==
 
== Choosing the type of analyzer ==
  
There are two types of analyzers: StreamThroughAnalyzer and StreamEndAnalyzer. The latter is more powerful and a bit easier to program, but has a limition: only one StreamEndAnalyzer can analyze a particular resource, while you can use as many StreamThroughAnalyzers as you like. Most analyzers can be written as StreamThroughAnalyzers. The most import exception is for analyzers that extract embedded resources from a stream. Examples of this are the ZipEndAnalyzer, the MailEndAnalyzer and the RpmEndAnalyzer.
+
There are two main types of analyzers: '''StreamThroughAnalyzer''' and '''StreamEndAnalyzer'''. The latter is more powerful and a bit easier to program, but has a limitation: only one StreamEndAnalyzer can analyze a particular resource, while you can use as many StreamThroughAnalyzers as you like. Most analyzers can be written as StreamThroughAnalyzers. The most important exception is for analyzers that extract embedded resources from a stream. Examples of this are the ZipEndAnalyzer, the MailEndAnalyzer and the RpmEndAnalyzer.
 +
 
 +
In this tutorial we focus on a simple example file type: BMP images. The information we will get from this file is located at the start of the file. It turns out that in this case, it is just as easy to implement the analyzer as a StreamEndAnalyzer as a StreamThroughAnalyzer. We will implement it as a StreamEndAnalyzer and point out how to do the same as a StreamThroughAnalyzer.
  
In this tutorial we focus on a simple example file type: BMP images. The information we will get from this file is located at the start of the file. It turns out that in this case, it is just as easy to implement the analyzer as a StreamEndAnalyzer as a StreamThroughAnalyzer. We will do both.
+
Two other types of stream analyzers have been added to Strigi. '''StreamLineAnalyzer''', for file format based on lines of plan text, '''StreamSaxAnalyzer''', for XML based file such as SVG.
  
 
== StreamEndAnalyzer ==
 
== StreamEndAnalyzer ==
  
Three functions need to be implemented in a StreamEndAnalyzer. The function of getName() is obvious so we will discuss only two: checkHeader() and analyzer().
+
Three functions need to be implemented in a StreamEndAnalyzer:
 +
* <tt>name()</tt>
 +
* <tt>checkHeader()</tt>
 +
* <tt>analyze()</tt>
  
<code cpp>
+
Here is what a class used to process BMP images might look like:
class BmpEndAnalyzer : public jstreams::StreamEndAnalyzer {
+
 
private:
+
<syntaxhighlight lang="cpp">
    const BmpEndAnalyzerFactory* factory;
+
#include <strigi/streamendanalyzer.h>
 +
 
 +
class BmpEndAnalyzerFactory;
 +
 
 +
class BmpEndAnalyzer : public Strigi::StreamEndAnalyzer  
 +
{
 
public:
 
public:
 
     BmpEndAnalyzer(const BmpEndAnalyzerFactory* f) :factory(f) {}
 
     BmpEndAnalyzer(const BmpEndAnalyzerFactory* f) :factory(f) {}
 +
    const char* name() const { return "BmpEndAnalyzer"; }
 
     bool checkHeader(const char* header, int32_t headersize) const;
 
     bool checkHeader(const char* header, int32_t headersize) const;
     char analyze(jstreams::Indexable& idx, jstreams::InputStream* in);
+
     char analyze(Strigi::AnalysisResult& idx,
     const char* getName() const { return "BmpEndAnalyzer"; }
+
                Strigi::InputStream* in);
 +
private:
 +
     const BmpEndAnalyzerFactory* factory;
 
};
 
};
</code>
+
</syntaxhighlight>
 +
 
 +
The <tt>factory</tt> object is used to load the analyzer at runtime and will be covered in detail later in this tutorial.
 +
 
 +
<tt>name()</tt> returns a unique name used to internally identify the indexer. As such the name does not need to be translated or suitable for display in a user interface.
  
Since only one StreamEndAnalyzer can be used per resource, it is important to quickly select the right one. For this, we dont rely on the mimetype, but on the actual contents of the resource. As a first sifting, the initial bytes of a file are checked. This is usually just as fast as comparing a mimetype identifier, but has the advantage of being more direct and thus often more accurate.
+
Since only one <tt>StreamEndAnalyzer</tt> can be used per resource, it is important to quickly select the right one. For this, we do not rely on the mimetype, but on the actual contents of the resource. As a first sifting, the initial bytes of a file are checked. This is usually just as fast as comparing a mimetype identifier, but has the advantage of being more direct and thus often more accurate.
  
It is by no means necessary for checkHeader to be 100% correct. The most important thing is that it should allow one to quickly determine if an analyzer  can not handle a resource. Should it by accident return true and thus indicate that a resource can be handled, then it can always handle this in the analyze() method.
+
It is by no means necessary for <tt>checkHeader</tt> to be 100% correct. The most important thing is that it should allow one to quickly determine if an analyzer  can not handle a resource. Should it by accident return true and thus indicate that a resource can be handled, then it can always handle this in the <tt>analyze()</tt> method.
  
<code cpp>
+
<syntaxhighlight lang="cpp">
bool
+
bool BmpEndAnalyzer::checkHeader( const char* header,
BmpEndAnalyzer::checkHeader(const char* header, int32_t headersize) const {
+
                                  int32_t headersize) const
 +
{
 
     bool ok = false;
 
     bool ok = false;
 
     if (headersize > 2) {
 
     if (headersize > 2) {
         ok |= strncmp(header, "BM", 2) == 0;
+
         ok = ok || (strncmp(header, "BM", 2) == 0);
         ok |= strncmp(header, "BA", 2) == 0;
+
         ok = ok || (strncmp(header, "BA", 2) == 0);
         ok |= strncmp(header, "CI", 2) == 0;
+
         ok = ok || (strncmp(header, "CI", 2) == 0);
         ok |= strncmp(header, "CP", 2) == 0;
+
         ok = ok || (strncmp(header, "CP", 2) == 0);
         ok |= strncmp(header, "IC", 2) == 0;
+
         ok = ok || (strncmp(header, "IC", 2) == 0);
         ok |= strncmp(header, "PT", 2) == 0;
+
         ok = ok || (strncmp(header, "PT", 2) == 0);
 
     }
 
     }
 
     return ok;
 
     return ok;
 
}
 
}
</code>
+
</syntaxhighlight>
  
 
A BMP file can start with six different initial bytes. If the header matches any of them, we return true.
 
A BMP file can start with six different initial bytes. If the header matches any of them, we return true.
  
If a resource passes this test, the analyze() function will be called. The ral work occurs in this function. In this tutorial, we will not do a complete analysis, but we only look at the way the pixels are stored in the BMP file. This information is stored in the bytes 30-33 that encode a number.
+
If a resource passes this test, the <tt>analyze()</tt> function will be called. The real work occurs in this function. In this tutorial, we will not do a complete analysis, but we only look at the way the pixels are stored in the BMP file. This information is stored in the bytes 30-33 that encode a number.
<code cpp>
+
<syntaxhighlight lang="cpp">
char
+
using namespace Strigi;
BmpEndAnalyzer::analyze(Indexable& idx, InputStream* in) {
+
char BmpEndAnalyzer::analyze(AnalysisResult& idx, InputStream* in)
 +
{
 
     // read compression type (bytes #30-33)
 
     // read compression type (bytes #30-33)
 
     const char* h;
 
     const char* h;
 
     int32_t n = in->read(h, 34, 34); // read exactly 34 bytes
 
     int32_t n = in->read(h, 34, 34); // read exactly 34 bytes
 
     in->reset(0);  // rewind to the start of the stream
 
     in->reset(0);  // rewind to the start of the stream
     if (n < 34) return Error;
+
     if (n < 34) {
 +
        return Error;
 +
    }
  
     uint32_t bmpi_compression = (unsigned char)h[33] + ((unsigned char)h[32]<<8)
+
     uint32_t bmpi_compression = (unsigned char)h[33] +
        + ((unsigned char)h[31]<<16) + ((unsigned char)h[30]<<24);
+
                                ((unsigned char)h[32]<<8) +
 +
                                ((unsigned char)h[31]<<16) +
 +
                                ((unsigned char)h[30]<<24);
  
 
     switch (bmpi_compression) {
 
     switch (bmpi_compression) {
 
     case 0 :
 
     case 0 :
         idx.setField(factory->compressionField, "None");
+
         idx.addValue(factory->compressionField, "None");
 
         break;
 
         break;
 
     case 1 :
 
     case 1 :
         idx.setField(factory->compressionField, "RLE 8bit/pixel");
+
         idx.addValue(factory->compressionField, "RLE 8bit/pixel");
 
         break;
 
         break;
 
     case 2 :
 
     case 2 :
         idx.setField(factory->compressionField, "RLE 4bit/pixel");
+
         idx.addValue(factory->compressionField, "RLE 4bit/pixel");
 
         break;
 
         break;
 
     case 3 :
 
     case 3 :
         idx.setField(factory->compressionField, "Bitfields");
+
         idx.addValue(factory->compressionField, "Bitfields");
 
         break;
 
         break;
 
     default :
 
     default :
         idx.setField(factory->compressionField, "Unknown");
+
         idx.addValue(factory->compressionField, "Unknown");
 
     }
 
     }
 
     return Ok;
 
     return Ok;
 
}
 
}
</code>
+
</syntaxhighlight>
First we read exactly 34 bytes from the stream. We dont need to allocate a buffer; this is handled by the stream. The stream returns a pointer to its internal buffer, which avoid data copying and buffer allocation. If the resource has less then 34 bytes or if an error occurred during reading, we return with an error code.
+
First we read exactly 34 bytes from the stream. We do not need to allocate a buffer; this is handled by the stream. The stream returns a pointer to its internal buffer, which avoids data copying and buffer allocation. If the resource has less than 34 bytes or if an error occurred during reading, we return with an error code.
  
The bytes 30-33 contain the information we need and we write this into the Indexable object. This object collects all metadata and passes it to the code that initiated the analysis. The data may get written into an index or get passed to for example a file dialog. To indicate the type of metadata, we pass a pointer to a registered field.
+
The bytes 30-33 contain the information we need and we write this into the <tt>AnalysisResult</tt> object. This object collects all metadata and passes it to the code that initiated the analysis. The data may get written into an index or get passed to for example a file dialog. To indicate the type of metadata, we pass a pointer to a registered field.
  
 
== StreamThroughAnalyzer ==
 
== StreamThroughAnalyzer ==
 +
 +
We will now look at how to implement a StreamThroughAnalyzer. For this, we need to implement three functions.
 +
<syntaxhighlight lang="cpp">
 +
class BmpThroughAnalyzer : public Strigi::StreamThroughAnalyzer {
 +
private:
 +
    Strigi::AnalysisResult* analysisResult;
 +
    const BmpThroughAnalyzerFactory* factory;
 +
public:
 +
    BmpThroughAnalyzer(const BmpThroughAnalyzerFactory* f) :factory(f) {}
 +
    ~BmpThroughAnalyzer() {}
 +
    void setIndexable(Strigi::AnalysisResult* i) { analysisResult = i; }
 +
    Strigi::InputStream *connectInputStream(Strigi::InputStream *in);
 +
    bool isReadyWithStream() { return true; }
 +
};
 +
</syntaxhighlight>
 +
 +
For simple file formats where all information is in the initial part of the file, all of the work gets done in connectInputStream(). For more complicated cases look at other examples, such as DigestThroughAnalyzer.
 +
In connectInputStream(), we perform the same analysis as we did in the StreamEndAnalyzer:
 +
<syntaxhighlight lang="cpp">Strigi::InputStream
 +
BmpThroughAnalyzer::connectInputStream(Strigi::InputStream* in) {    // read compression type (bytes #30-33)
 +
    const char* h;
 +
    int32_t n = in->read(h, 34, 34); // read exactly 34 bytes
 +
    in->reset(0);  // rewind to the start of the stream
 +
    if (n < 34) return in;
 +
 +
    uint32_t bmpi_compression = (unsigned char)h[33] + ((unsigned char)h[32]<<8)
 +
        + ((unsigned char)h[31]<<16) + ((unsigned char)h[30]<<24);
 +
 +
    switch (bmpi_compression) {
 +
    case 0 :
 +
        analysisResult->addValue(factory->compressionField, "None");
 +
        break;
 +
    case 1 :
 +
        analysisResult->addValue(factory->compressionField, "RLE 8bit/pixel");
 +
        break;
 +
    case 2 :
 +
        analysisResult->addValue(factory->compressionField, "RLE 4bit/pixel");
 +
        break;
 +
    case 3 :
 +
        analysisResult->addValue(factory->compressionField, "Bitfields");
 +
        break;
 +
    default :
 +
        analysisResult->addValue(factory->compressionField, "Unknown");
 +
    }
 +
    return in;
 +
}
 +
</syntaxhighlight>
 +
The difference with StreamEndAnalyzer is that we must make sure we only return a stream that has the current position at the start of the stream. We can ensure this by calling reset(0) after each call to read(). This will not affect the pointer to the character data that was set in the call to read().
 +
 +
In addition, we do not return any status message, but an InputStream. In this case this is the same stream. For more complicated cases, we can subclass the InputStream and analyze the data that passes through it. This is where the StreamThroughAnalyzer gets its name.
 +
 +
Since we have finished with the stream after the call to connectInputStream, we implement isReadyWithStream() to return true. This function is used by StreamIndexer to stop reading as soon as possible to speed up the analysis.
  
 
== Registering the analyzer ==
 
== Registering the analyzer ==
Line 113: Line 194:
 
=== AnalyzerFactoryFactory ===
 
=== AnalyzerFactoryFactory ===
  
The AnalyzerFactoryFactory is used only in the loadable plugins. It is not needed for analyzers that are part of the Strigi core. To initialize the StreamThroughAnalyzerFactories and StreamEndAnalyzerFactories in plugins, we need to implement an AnalyzerFactoryFactory. We do so by implementing one or two functions: getStreamThroughAnalyzerFactories() and getStreamEndAnalyzerFactories(). This function returns instances of all the factories available in a plugin. Here is for example the AnalyzerFactoryFactory for the KDE trash file analyzer (we cannot use the BMP analyzer here, because it is in Strigi core):
+
The AnalyzerFactoryFactory is used only in the loadable plugins. It is not needed for analyzers that are part of the Strigi core. To initialize the StreamThroughAnalyzerFactories and StreamEndAnalyzerFactories in plugins, we need to implement an AnalyzerFactoryFactory. We do so by implementing one or two functions: streamThroughAnalyzerFactories() and streamEndAnalyzerFactories(). This function returns instances of all the factories available in a plugin. Here is for example the AnalyzerFactoryFactory for the KDE trash file analyzer (we cannot use the BMP analyzer here, because it is in Strigi core):
<code cpp>
+
<syntaxhighlight lang="cpp">
 
class Factory : public AnalyzerFactoryFactory {
 
class Factory : public AnalyzerFactoryFactory {
 
public:
 
public:
 
     list<StreamAnalyzerFactory*>
 
     list<StreamAnalyzerFactory*>
     getStreamThroughAnalyzerFactories() const {
+
     streamThroughAnalyzerFactories() const {
 
         list<StreamThroughAnalyzerFactory*> af;
 
         list<StreamThroughAnalyzerFactory*> af;
 
         af.push_back(new TrashThroughAnalyzerFactory());
 
         af.push_back(new TrashThroughAnalyzerFactory());
Line 127: Line 208:
 
// macro that initializes the Factory when the plugin is loaded
 
// macro that initializes the Factory when the plugin is loaded
 
STRIGI_ANALYZER_FACTORY(Factory)
 
STRIGI_ANALYZER_FACTORY(Factory)
</code>
+
</syntaxhighlight>
  
 
=== StreamEndAnalyzerFactory ===
 
=== StreamEndAnalyzerFactory ===
  
StreamEndAnalyzerFactories and StreamThroughAnalyzerFactories are very similar. They provide information about analyzers and the create instances of the analyzers. Each analyzer must hava a factory. StreamThroughAnalyzers have StreamThroughAnalyzerFactories and StreamEndAnalyzers have StreamEndAnalyzerFactories. Here, we look at the factory for the BmpEndAnalyzer.
+
StreamEndAnalyzerFactories and StreamThroughAnalyzerFactories are very similar. They provide information about analyzers and the create instances of the analyzers. Each analyzer must have a factory. StreamThroughAnalyzers have StreamThroughAnalyzerFactories and StreamEndAnalyzers have StreamEndAnalyzerFactories. Here, we look at the factory for the BmpEndAnalyzer.
  
 
The class BmpEndAnalyzerFactory looks like this:
 
The class BmpEndAnalyzerFactory looks like this:
  
<code cpp>
+
<syntaxhighlight lang="cpp">
class BmpEndAnalyzerFactory : public jstreams::StreamEndAnalyzerFactory {
+
class BmpEndAnalyzerFactory : public Strigi::StreamEndAnalyzerFactory {
 
friend class BmpEndAnalyzer;
 
friend class BmpEndAnalyzer;
 
private:
 
private:
 
     static const cnstr typeFieldName;
 
     static const cnstr typeFieldName;
 
     static const cnstr compressionFieldName;
 
     static const cnstr compressionFieldName;
     const jstreams::RegisteredField* typeField;
+
     const Strigi::RegisteredField* typeField;
     const jstreams::RegisteredField* compressionField;
+
     const Strigi::RegisteredField* compressionField;
     const char* getName() const {
+
     const char* name() const {
 
         return "BmpEndAnalyzer";
 
         return "BmpEndAnalyzer";
 
     }
 
     }
     jstreams::StreamEndAnalyzer* newInstance() const {
+
     Strigi::StreamEndAnalyzer* newInstance() const {
 
         return new BmpEndAnalyzer(this);
 
         return new BmpEndAnalyzer(this);
 
     }
 
     }
     void registerFields(jstreams::FieldRegister&);
+
     void registerFields(Strigi::FieldRegister&);
 
};
 
};
</code>
+
</syntaxhighlight>
  
All members are private, which is ok, because the important functions are virtual and thus accessible anyway. The functions getName() and newInstance() are selfexplanatory. The other important function is registerFields(jstreams::FieldRegister&). To speed up the extraction of fields, we dont use strings to identify fiels, but we use pointers to registered fields. These RegisteredField instances are stored in a global register. When you extract a piece of metadata, you pass the pointer to the registered field to identify the metadata.
+
All members are private, which is ok, because the important functions are virtual and thus accessible anyway. The functions name() and newInstance() are selfexplanatory. The other important function is registerFields(Strigi::FieldRegister&). To speed up the extraction of fields, we dont use strings to identify fields, but we use pointers to registered fields. These RegisteredField instances are stored in a global register. When you extract a piece of metadata, you pass the pointer to the registered field to identify the metadata.
  
 
But first we need to register the fields:
 
But first we need to register the fields:
  
<code cpp>
+
<syntaxhighlight lang="cpp">
void
+
#include <strigi/fieldtypes.h>
BmpEndAnalyzerFactory::registerFields(FieldRegister& reg) {
+
#include <strigi/analysisresult.h>
     typeField = reg.registerField(typeFieldName, FieldRegister::stringType,
+
 
        1, 0);
+
void BmpEndAnalyzerFactory::registerFields(Strigi::FieldRegister& reg) {
 +
     typeField = reg.registerField(typeFieldName,
 +
                                  Strigi::FieldRegister::stringType,
 +
                                  1, 0);
 
     compressionField = reg.registerField(compressionFieldName,
 
     compressionField = reg.registerField(compressionFieldName,
        FieldRegister::stringType, 1, 0);
+
                                        Strigi::FieldRegister::stringType,
 +
                                        1, 0);
 
}
 
}
</code>
+
</syntaxhighlight>
  
 
We pass the key of the field to the register, along with it's type, the maximum number of times the field occurs per resource and the parent of the field. The parent of a song tile, for example, could be a more general title field. The datatype of this field should be the same or a subset of the fieldtype of the parent.
 
We pass the key of the field to the register, along with it's type, the maximum number of times the field occurs per resource and the parent of the field. The parent of a song tile, for example, could be a more general title field. The datatype of this field should be the same or a subset of the fieldtype of the parent.
  
 
The pointers to the registered fields are used during the analysis to identify the type of data we have analyzed:
 
The pointers to the registered fields are used during the analysis to identify the type of data we have analyzed:
<code cpp>
+
<syntaxhighlight lang="cpp">
   idx.setField(factory->typeField, "OS/2 Color Icon");
+
   idx.addValue(factory->typeField, "OS/2 Color Icon");
</code>
+
</syntaxhighlight>
  
 
== Building the analyzer ==
 
== Building the analyzer ==
  
 
Building your analyzer is easy. There are three things you must take into account:
 
Building your analyzer is easy. There are three things you must take into account:
- Link the analyzer as a module,
+
* Link the analyzer as a module,
- Let the name of the analyzer start with 'strigita_' for a StreamThroughAnalyzer and 'strigiea_' for a StreamEndAnalyzer,
+
* Let the name of the analyzer start with 'strigita_' for a StreamThroughAnalyzer and 'strigiea_' for a StreamEndAnalyzer,
- Install the plugin in the lib/strigi directory.
+
* Install the plugin in the lib/strigi directory.
 
+
Here is the CMakeLists.txt code to do this:
<code>
+
<syntaxhighlight lang="cmake">
 
add_library(trash MODULE trashthroughanalyzer.cpp trashimpl.cpp)
 
add_library(trash MODULE trashthroughanalyzer.cpp trashimpl.cpp)
target_link_libraries(trash ${STREAMINDEXER_LIBRARY} ${KDE4_KIO_LIBS} ${KDE4_SOLID_LIBS})
+
target_link_libraries(trash ${STRIGI_STREAMANALYZER_LIBRARY})
 
set_target_properties(trash PROPERTIES
 
set_target_properties(trash PROPERTIES
 
     PREFIX strigita_)
 
     PREFIX strigita_)
install(TARGETS trash LIBRARY DESTINATION ${LIB_DESTINATION}/strigi)
+
install(TARGETS trash LIBRARY DESTINATION ${LIB_INSTALL_DIR}/strigi)
</code>
+
</syntaxhighlight>
  
 
== Testing your code ==
 
== Testing your code ==

Latest revision as of 11:20, 30 June 2011

Contents

[edit] Writing KDE4 file analyzers

File analyzers extract data from files to display in the file dialogs and file managers. The data gathered this way is also used to search for files. KDE4 allows the use of multiple analyzers per file type. Analyzers can extract text which is used for indexing, but they can also retrieve other data such as song title, album title, recipient, md5 sum, the mimetype of a file, and much more.

This tutorial describes how you can write new analyzers.

[edit] Primer

[edit] What are file analyzers?

A file analyzer is a class that extracts metadata from a file or data stream. You can have file analyzers that are specific for certain file types such as an analyzer that extracts the information from an ogg vorbis file. There are also more general file analyzers that calculate for example the md5 or sha1 of a file.

[edit] File analyzers in KDE4

KDE4 uses stream based file analyzers for retrieving text and metadata from files. This has a number of advantages over file based methods. Stream based access

  • is faster for 90% of the file types,
  • allows easy analysis of embedded files such as email attachments or entries from zip files, rpms and many other container file formats.

Writing stream-based analyzers requires a different approach than the usual file-based methods and in the tutorial we will explain how to go about it.

The current state of porting the KDE3 kfile plugins to KDE4 stream analyzers can be seen at http://wiki.kde.org/tiki-index.php?page=Porting+KFilePlugin+Progress.

[edit] Look for existing code

If you want to see some code examples, take a look at the already implemented file analyzers at /kdesupport/strigi/src/streamanalyzer/.

Some examples of meta-data extraction from files can also be found in the Hachoir project in the online parser sourcecode.

[edit] Choosing the type of analyzer

There are two main types of analyzers: StreamThroughAnalyzer and StreamEndAnalyzer. The latter is more powerful and a bit easier to program, but has a limitation: only one StreamEndAnalyzer can analyze a particular resource, while you can use as many StreamThroughAnalyzers as you like. Most analyzers can be written as StreamThroughAnalyzers. The most important exception is for analyzers that extract embedded resources from a stream. Examples of this are the ZipEndAnalyzer, the MailEndAnalyzer and the RpmEndAnalyzer.

In this tutorial we focus on a simple example file type: BMP images. The information we will get from this file is located at the start of the file. It turns out that in this case, it is just as easy to implement the analyzer as a StreamEndAnalyzer as a StreamThroughAnalyzer. We will implement it as a StreamEndAnalyzer and point out how to do the same as a StreamThroughAnalyzer.

Two other types of stream analyzers have been added to Strigi. StreamLineAnalyzer, for file format based on lines of plan text, StreamSaxAnalyzer, for XML based file such as SVG.

[edit] StreamEndAnalyzer

Three functions need to be implemented in a StreamEndAnalyzer:

  • name()
  • checkHeader()
  • analyze()

Here is what a class used to process BMP images might look like:

#include <strigi/streamendanalyzer.h>
 
class BmpEndAnalyzerFactory;
 
class BmpEndAnalyzer : public Strigi::StreamEndAnalyzer 
{
public:
    BmpEndAnalyzer(const BmpEndAnalyzerFactory* f) :factory(f) {}
    const char* name() const { return "BmpEndAnalyzer"; }
    bool checkHeader(const char* header, int32_t headersize) const;
    char analyze(Strigi::AnalysisResult& idx,
                 Strigi::InputStream* in);
private:
    const BmpEndAnalyzerFactory* factory;
};

The factory object is used to load the analyzer at runtime and will be covered in detail later in this tutorial.

name() returns a unique name used to internally identify the indexer. As such the name does not need to be translated or suitable for display in a user interface.

Since only one StreamEndAnalyzer can be used per resource, it is important to quickly select the right one. For this, we do not rely on the mimetype, but on the actual contents of the resource. As a first sifting, the initial bytes of a file are checked. This is usually just as fast as comparing a mimetype identifier, but has the advantage of being more direct and thus often more accurate.

It is by no means necessary for checkHeader to be 100% correct. The most important thing is that it should allow one to quickly determine if an analyzer can not handle a resource. Should it by accident return true and thus indicate that a resource can be handled, then it can always handle this in the analyze() method.

bool BmpEndAnalyzer::checkHeader( const char* header,
                                  int32_t headersize) const
{
    bool ok = false;
    if (headersize > 2) {
        ok = ok || (strncmp(header, "BM", 2) == 0);
        ok = ok || (strncmp(header, "BA", 2) == 0);
        ok = ok || (strncmp(header, "CI", 2) == 0);
        ok = ok || (strncmp(header, "CP", 2) == 0);
        ok = ok || (strncmp(header, "IC", 2) == 0);
        ok = ok || (strncmp(header, "PT", 2) == 0);
    }
    return ok;
}

A BMP file can start with six different initial bytes. If the header matches any of them, we return true.

If a resource passes this test, the analyze() function will be called. The real work occurs in this function. In this tutorial, we will not do a complete analysis, but we only look at the way the pixels are stored in the BMP file. This information is stored in the bytes 30-33 that encode a number.

using namespace Strigi;
char BmpEndAnalyzer::analyze(AnalysisResult& idx, InputStream* in)
{
    // read compression type (bytes #30-33)
    const char* h;
    int32_t n = in->read(h, 34, 34); // read exactly 34 bytes
    in->reset(0);   // rewind to the start of the stream
    if (n < 34) {
        return Error;
    }
 
    uint32_t bmpi_compression = (unsigned char)h[33] +
                                ((unsigned char)h[32]<<8) +
                                ((unsigned char)h[31]<<16) +
                                ((unsigned char)h[30]<<24);
 
    switch (bmpi_compression) {
    case 0 :
        idx.addValue(factory->compressionField, "None");
        break;
    case 1 :
        idx.addValue(factory->compressionField, "RLE 8bit/pixel");
        break;
    case 2 :
        idx.addValue(factory->compressionField, "RLE 4bit/pixel");
        break;
    case 3 :
        idx.addValue(factory->compressionField, "Bitfields");
        break;
    default :
        idx.addValue(factory->compressionField, "Unknown");
    }
    return Ok;
}

First we read exactly 34 bytes from the stream. We do not need to allocate a buffer; this is handled by the stream. The stream returns a pointer to its internal buffer, which avoids data copying and buffer allocation. If the resource has less than 34 bytes or if an error occurred during reading, we return with an error code.

The bytes 30-33 contain the information we need and we write this into the AnalysisResult object. This object collects all metadata and passes it to the code that initiated the analysis. The data may get written into an index or get passed to for example a file dialog. To indicate the type of metadata, we pass a pointer to a registered field.

[edit] StreamThroughAnalyzer

We will now look at how to implement a StreamThroughAnalyzer. For this, we need to implement three functions.

class BmpThroughAnalyzer : public Strigi::StreamThroughAnalyzer {
private:
    Strigi::AnalysisResult* analysisResult;
    const BmpThroughAnalyzerFactory* factory;
public:
    BmpThroughAnalyzer(const BmpThroughAnalyzerFactory* f) :factory(f) {}
    ~BmpThroughAnalyzer() {}
    void setIndexable(Strigi::AnalysisResult* i) { analysisResult = i; }
    Strigi::InputStream *connectInputStream(Strigi::InputStream *in);
    bool isReadyWithStream() { return true; }
};

For simple file formats where all information is in the initial part of the file, all of the work gets done in connectInputStream(). For more complicated cases look at other examples, such as DigestThroughAnalyzer. In connectInputStream(), we perform the same analysis as we did in the StreamEndAnalyzer:

Strigi::InputStream
BmpThroughAnalyzer::connectInputStream(Strigi::InputStream* in) {    // read compression type (bytes #30-33)
    const char* h;
    int32_t n = in->read(h, 34, 34); // read exactly 34 bytes
    in->reset(0);   // rewind to the start of the stream
    if (n < 34) return in;
 
    uint32_t bmpi_compression = (unsigned char)h[33] + ((unsigned char)h[32]<<8)
         + ((unsigned char)h[31]<<16) + ((unsigned char)h[30]<<24);
 
    switch (bmpi_compression) {
    case 0 :
        analysisResult->addValue(factory->compressionField, "None");
        break;
    case 1 :
        analysisResult->addValue(factory->compressionField, "RLE 8bit/pixel");
        break;
    case 2 :
        analysisResult->addValue(factory->compressionField, "RLE 4bit/pixel");
        break;
    case 3 :
        analysisResult->addValue(factory->compressionField, "Bitfields");
        break;
    default :
        analysisResult->addValue(factory->compressionField, "Unknown");
    }
    return in;
}

The difference with StreamEndAnalyzer is that we must make sure we only return a stream that has the current position at the start of the stream. We can ensure this by calling reset(0) after each call to read(). This will not affect the pointer to the character data that was set in the call to read().

In addition, we do not return any status message, but an InputStream. In this case this is the same stream. For more complicated cases, we can subclass the InputStream and analyze the data that passes through it. This is where the StreamThroughAnalyzer gets its name.

Since we have finished with the stream after the call to connectInputStream, we implement isReadyWithStream() to return true. This function is used by StreamIndexer to stop reading as soon as possible to speed up the analysis.

[edit] Registering the analyzer

KDE4 keeps a register of the capabilities of each analyzer. This allows it to speed up determining which analyzers to use. In addition, when it knows what data type an analyzer provides under what name, it can use this information to optimize the storage of the data or search queries. For this, each loadable analyzer must define two factories. An AnalyzerFactoryFactory and either a StreamThroughAnalyzerFactory or a StreamEndAnalyzerFactory.

[edit] AnalyzerFactoryFactory

The AnalyzerFactoryFactory is used only in the loadable plugins. It is not needed for analyzers that are part of the Strigi core. To initialize the StreamThroughAnalyzerFactories and StreamEndAnalyzerFactories in plugins, we need to implement an AnalyzerFactoryFactory. We do so by implementing one or two functions: streamThroughAnalyzerFactories() and streamEndAnalyzerFactories(). This function returns instances of all the factories available in a plugin. Here is for example the AnalyzerFactoryFactory for the KDE trash file analyzer (we cannot use the BMP analyzer here, because it is in Strigi core):

class Factory : public AnalyzerFactoryFactory {
public:
    list<StreamAnalyzerFactory*>
    streamThroughAnalyzerFactories() const {
        list<StreamThroughAnalyzerFactory*> af;
        af.push_back(new TrashThroughAnalyzerFactory());
        return af;
    }
};
 
// macro that initializes the Factory when the plugin is loaded
STRIGI_ANALYZER_FACTORY(Factory)

[edit] StreamEndAnalyzerFactory

StreamEndAnalyzerFactories and StreamThroughAnalyzerFactories are very similar. They provide information about analyzers and the create instances of the analyzers. Each analyzer must have a factory. StreamThroughAnalyzers have StreamThroughAnalyzerFactories and StreamEndAnalyzers have StreamEndAnalyzerFactories. Here, we look at the factory for the BmpEndAnalyzer.

The class BmpEndAnalyzerFactory looks like this:

class BmpEndAnalyzerFactory : public Strigi::StreamEndAnalyzerFactory {
friend class BmpEndAnalyzer;
private:
    static const cnstr typeFieldName;
    static const cnstr compressionFieldName;
    const Strigi::RegisteredField* typeField;
    const Strigi::RegisteredField* compressionField;
    const char* name() const {
        return "BmpEndAnalyzer";
    }
    Strigi::StreamEndAnalyzer* newInstance() const {
        return new BmpEndAnalyzer(this);
    }
    void registerFields(Strigi::FieldRegister&);
};

All members are private, which is ok, because the important functions are virtual and thus accessible anyway. The functions name() and newInstance() are selfexplanatory. The other important function is registerFields(Strigi::FieldRegister&). To speed up the extraction of fields, we dont use strings to identify fields, but we use pointers to registered fields. These RegisteredField instances are stored in a global register. When you extract a piece of metadata, you pass the pointer to the registered field to identify the metadata.

But first we need to register the fields:

#include <strigi/fieldtypes.h>
#include <strigi/analysisresult.h>
 
void BmpEndAnalyzerFactory::registerFields(Strigi::FieldRegister& reg) {
    typeField = reg.registerField(typeFieldName,
                                  Strigi::FieldRegister::stringType,
                                  1, 0);
    compressionField = reg.registerField(compressionFieldName,
                                         Strigi::FieldRegister::stringType,
                                         1, 0);
}

We pass the key of the field to the register, along with it's type, the maximum number of times the field occurs per resource and the parent of the field. The parent of a song tile, for example, could be a more general title field. The datatype of this field should be the same or a subset of the fieldtype of the parent.

The pointers to the registered fields are used during the analysis to identify the type of data we have analyzed:

   idx.addValue(factory->typeField, "OS/2 Color Icon");

[edit] Building the analyzer

Building your analyzer is easy. There are three things you must take into account:

  • Link the analyzer as a module,
  • Let the name of the analyzer start with 'strigita_' for a StreamThroughAnalyzer and 'strigiea_' for a StreamEndAnalyzer,
  • Install the plugin in the lib/strigi directory.

Here is the CMakeLists.txt code to do this:

add_library(trash MODULE trashthroughanalyzer.cpp trashimpl.cpp)
target_link_libraries(trash ${STRIGI_STREAMANALYZER_LIBRARY})
set_target_properties(trash PROPERTIES
    PREFIX strigita_)
install(TARGETS trash LIBRARY DESTINATION ${LIB_INSTALL_DIR}/strigi)

[edit] Testing your code

Strigi comes with a simple command line tool to check if your plugins work. This tool is called xmlindexer. It extracts data from files and outputs it as simple xml. To use it call it like this:

xmlindexer [FILE]

or

xmlindexer [DIR]

This is very fast and I recommend using it with valgrind. This hardly slows down your workflow but helps to keep memory management in good shape:

valgrind xmlindexer [DIR]

This page was last modified on 30 June 2011, at 11:20. This page has been accessed 20,197 times. Content is available under Creative Commons License SA 3.0 as well as the GNU Free Documentation License 1.2.
KDE® and the K Desktop Environment® logo are registered trademarks of KDE e.V.Legal