User:Cmollekopf/Drafts/AkonadiNepmukFeeder

Overview

The feeder consist of an agent, which receives notification about new/changed items using a ChangeRecorder, and a feederqueue, which processes the actual indexing.

The changes received from the ChangeRecorder result in items being added to the feederqueue, which consists itself of three queues:

collectionqueue: Queue for collections to index
high prio queue: item indexing queue with high priority
low prio queue: item indexing queue with low priority (Indexes slower, if the high prio queue has items those take precedence)

Initial indexing

On startup the indexer goes through all collections if they have already been indexed (which is established by a query to nepomuk), and if not it adds not indexed collections to the collectionqueue, which will result in the items being added to the low priority queue (the initial indexing is always considered a low priority task).

Challenges

We're triggering the work but are not actually doing it. All the hard work is done in virtuosos, so if we throw to much work at it, it either goes crazy or is just busy for a pretty long time, therefore we work with timeouts between indexing items, and try this way to not overload virtuoso.
We don't get any priority for notifications. It's generally speaking the same if you edit a note (which should be indexed instantly because applications may rely on that data) or if you add an email account with 300'000 mails (for which we may take our time).
The code is without unittests. Due to the use of async akonadi and nepomuk calls throughout the code, all those async calls would need to be wrapped to allow slipping mock objects which could be used to test the indexing.

Honoring power consumption and CPU usage

Currently we have the idledetection provided by KIdleTime already in place, we don't honor the powersource though, as well as we don't avoid the indexing while watching a movie or so.

Avoiding unnecessary indexing

We're already filtering some changes which don't affect the indexed data and I'm not sure if there is much more we can do in that respect. It might be worthwile though to optimize certain spcial cases such as mass flag changes (\seen on email). Since there is only very little data changing for many items, this could affect the performance quite a bit if there is a way store this information more efficiently.

Querying

Queries are used to check whether a collection/item needs to be (re-)indexed after getting i.e. a notification for an item (so it ends up in the indexing queue). For instance, all collections are re-queried on every startup if they have already been indexed (for the initial indexing). In an ideal world, that re-querying would of course not be necessary as the change-recorder would allow us to be notified of each and every change that happens to the akonadi store, and we can index accordingly, however, we used to loose updates in the feeder itself, some items fail to be indexed, or some other unforseen problem can lead to unindexed items in indexed collections meaning they will never ever get indexed (unless modfied again). The only reasonable way to fix this IMO is to requery ALL items from time to time and index what's missing.

As a nice side-effect that would also allow us to be a bit sloppier with the received notifications as it can be quite a PITA to make sure none of those notifications are lost and all get processed eventually. Especially when the feeder process can be killed, indexing can be turned off (and meanwhile notificaitons pile up indefinitely), the computer can just crash...

So re-querying really seems a more sustainable approach. The problem is however the amount of time it takes to query each item one by one.

Only checking ~9000 emails whether they have already been indexed takes here around 370s (that's 40ms per mail). If you think of email collections of up to 1'000'000 mails that's more than 41100s (~11h). Which means IMO that we either need to get performance improvements in the order of 10 or we need to prioritize some collections (such as non-mail and for mail inbox only etc.). I do think that it should be in the realm of the achievable to query for ~1000000 items within a couple of minutes, so we can execute this as regular maintenance task. We might have to change our strategy a bit though.

E.g. we could query for lists of items (once in nepomuk and once in akonadi), comparing them would already give us a list of missing items. With that we could also save use afterwards one query per item to check if it's already indexed (we know it isn't).

Should querying nepomuk indeed show to be that slow due to its internal database structure, we could still work with an akonadi attribute (I'd like to avoid that if possible though).

Note that repeatedly querying for the same item results in queries being completed in <2ms due to the virtuoso internal caching.

Indexing

Indexing is fairly slow as kdepim-runtime/agents/nepomukfeeder/test/performancetest.cpp shows. The simplest mail possible takes already ~0.5s to index on my machine, I don't know however where most of that time is spent.

Another problem are i.e. mails with several Cc email addresses. I can i.e. index up to ~12 Cc mail without problem (and it takes maybe ~0.7s), but as soon as I go above this the indexing takes like ~10s and then virtuoso goes crazy.

With an even larger number of CC mailaddresses viruoso just bails out with a bufferoverflow because one of the generated queries is just too large.

The middle case is of course the most dangerous as it results in viruoso not doing any work anymore while burning the cpu until it's killed.

The whole feeder is designed for batch processing with the original idea that this would improve the performance greatly, currently we're processing batches of 1 though to not trigger the above problem. If we have more efficient querying which works on batches of items, and the caching which could potentially greatly reduce the size of the stored SimpleResourceGraphs, it may start to make sense to work on batches of items.

Caching for Batch indexing

As Vishesh suggested the caching could give a serious performance boost when indexing lots of similar items (such as a mailinglist).

Dealing with large batches of new/changed items

If you're i.e. adding a new email account, that results in all 300'000 emails being added as new task to the indexing queue. Currently those end up in the high priority queue, but they should really be treated just as the initial indexing. One way would be to move them to the lowPrio queue, but I'd prefer just skipping them and letting the maintenance task pick them up. The challenge is to distinguish between items which are part of new collections and regular changes which may happen at the same time.

Other indexing strategies

Another strategy to speed the initial indexing up could be to first index i.e. subject only, and in a second round go through all remaining stuff such as text content and attachments. With the current indexing and quering performance I doubt it makes sense to work on this though.

Other issues

Maildir with fileindexer could produce duplicates:

If the fileindexer indexes the same items that could lead to duplicates when searching for fulltext. Not sure if they're indexed and the problem is resolved for searching due to the AkonadiObject type used in the feeder.

Maildir indexing probably doesn't work because we use cacheOnly (to not download disconnected IMAP stuff), but maildir is never in the cache.

I have never tested it, but this would leadt to maildir resources (and probably others as well), to never be indexed due to the failing ItemFetchJob.

We can't remove resources first as otherwise relations (i.e. to a topic) are lost (which are subresources of the resource):

According to vishesh it would be performance-wise to not use the OverwriteProperties flag, and instead fully remove the resource and index it freshly. The problem is that removeDataByApplication, when using the RemoveSubresources flag, removes also the created PIMO representation (correctly so), and thus also all relations (e.g. a relation to a topic), which makes the indexed resource unusable from applications (or at least vastly less useful).

Notes

I conducted my tests on a T420s with an Intel i7-2640M processor and an SSD disk, so performance may be worse on other systems.

Measurements

In the order of priority:

Improve query performance:

If we're faster with querying we can tackle several problems at the same time. The normal indexing is simply faster, and it's prerequisite so the following measurements work well.

Implement initial indexing as recurring maintenance task:

By implementing the initial indexing as recurring maintenance task, we no longer have to store all notifications, and can simply skip large batches of added items (as they will be picked up eventually). This means we can disable the ChangeRecorder completely while the indexer is turned off (no huge changefiles, as well as no outdated notifications anymore). So overall it would simplify the whole process greatly and fix a couple of nasty problems. For this one it's especially important to have decent query performance as we will repeatedly query all items existing, so something like a per collection mass query would make sense I think.

The only problem with the recurring maintenance task are items which fail to be indexed and will repeatedly tried if not blocked somehow (i.e. by an attribute).

Either fix nepomuk/virtuoso or block indexing of items which we know can fail (lots of To:/Cc: emails for instance)
Honor powersource
Caching for Batch indexing (at least for mail)