Note: this tutorial is not finished yet

Writing KDE4 file analyzers

File analyzers extract data from files to display in the file dialogs and file managers. The data gathered this way is also used to search for files. KDE4 allows the use of multiple analyzers per file type. Analyzers can extract text which is used for indexing, but they can also retrieve other data such as song title, album title, recipient, md5 sum, the mimetype of a file, and much more.

This tutorial describes how you can write new analyzers.

Primer

What are file analyzers?

File analyzers in KDE4

KDE4 uses stream based file analyzers for retrieving text and metadata from files. This has a number of advantages over file based methods. Stream based access

is faster for 90% of the file types,
allows easy analysis of embedded files such as email attachments or entries from zip files, rpms and many other container file formats.

Writing stream-based analyzers requires a different approach than the usual file-based methods and in the tutorial we will explain how to go about it.

Finding documentation

Look for existing code

If you want to see some code examples, take a look at the already implemented file analyzers at /kdesupport/strigi/src/streamindexer/

Registering the analyzer

KDE4 keeps a register of the capabilities of each analyzer. This allows it to speed up determining which analyzers to use. In addition, when it knows what data type an analyzer provides under what name, it can use this information to optimize the storage of the data or search queries. For this, each loadable analyzer must define two factories. An AnalyzerFactoryFactory and either a StreamThroughAnalyzerFactory or a StreamEndAnalyzerFactory.

AnalyzerFactoryFactory

The AnalyzerFactoryFactory is used only in the loadable plugins. It is not needed for analyzers that are part of the Strigi core. To initialize the StreamThroughAnalyzerFactories and StreamEndAnalyzerFactories. It does so by implementing one or two functions: getStreamThroughAnalyzerFactories() and getStreamEndAnalyzerFactories(). This function returns instances of all the factories available in a plugin. Here is for example the AnalyzerFactoryFactory for the KDE trash file analyzer: class Factory : public AnalyzerFactoryFactory { public:

   list<StreamThroughAnalyzerFactory*>
   getStreamThroughAnalyzerFactories() const {
       list<StreamThroughAnalyzerFactory*> af;
       af.push_back(new TrashThroughAnalyzerFactory());
       return af;
   }

};

STRIGI_ANALYZER_FACTORY(Factory)

StreamEndAnalyzerFactory

StreamEndAnalyzerFactories and StreamThroughAnalyzerFactories are very similar. They provide information about analyzers and the create instances of the analyzers. Each analyzer must hava a factory. StreamThroughAnalyzers have StreamThroughAnalyzerFactories and StreamEndAnalyzers have StreamEndAnalyzerFactories. Here, we look at the factory for the BmpEndAnalyzer.

The class BmpEndAnalyzerFactory looks like this:

class BmpEndAnalyzerFactory : public jstreams::StreamEndAnalyzerFactory {
friend class BmpEndAnalyzer;
private:
    static const cnstr typeFieldName;
    static const cnstr compressionFieldName;
    const jstreams::RegisteredField* typeField;
    const jstreams::RegisteredField* compressionField;
    const char* getName() const {
        return "BmpEndAnalyzer";
    }
    jstreams::StreamEndAnalyzer* newInstance() const {
        return new BmpEndAnalyzer(this);
    }
    void registerFields(jstreams::FieldRegister&);
};

All members are private, which is ok, because the important functions are virtual and thus accessible anyway. The functions getName() and newInstance() are selfexplanatory. The other important function is registerFields(jstreams::FieldRegister&). To speed up the extraction of fields, we dont use strings to identify fiels, but we use pointers to registered fields. These RegisteredField instances are stored in a global register. When you extract a piece of metadata, you pass the pointer to the registered field to identify the metadata.

But first we need to register the fields:

void
BmpEndAnalyzerFactory::registerFields(FieldRegister& reg) {
    typeField = reg.registerField(typeFieldName, FieldRegister::stringType,
        1, 0);
    compressionField = reg.registerField(compressionFieldName,
        FieldRegister::stringType, 1, 0);
}

We pass the key of the field to the register, along with it's type, the maximum number of times the field occurs per resource and the parent of the field. The parent of a song tile, for example, could be a more general title field. The datatype of this field should be the same or a subset of the fieldtype of the parent.

Testing your code

Strigi comes with a simple command line tool to check if your plugins work. This tool is called xmlindexer. It extracts data from files and outputs it as simple xml. To use it call it like this:

xmlindexer [FILE]

or

xmlindexer [DIR]

This is very fast and I recommend using it with valgrind. This hardly slows down your workflow but helps to keep memory managment in good shape:

valgrind xmlindexer [DIR]