Fixing and Polishing Search
At the Osnabrück PIM Meeting 2012 we started an effort to make search and indexing PIM data useful. The following tree classifies the work that has been done
Faults in indexing
Performance faults while indexing
Excessive work per item
- FIXED Excessive queries per item kde#289932#c58 , kde#289932#c87 (754275eda610dce1160286a76339353097d8764c in kde-runtime/4.8)
- Attachments fetched but not effectively indexed. The problem with attachments is that they are indexed by a helper process (nepomukindexer), which needs to final URI of the attachment object. However, what we pass in is the temporary _:xxxx URIs that still need to be resolved by DMS. StoreResourceJob contains the mapping AFAICT, so it's probably just a matter of deferring the indexData() calls until we have the result of that job.
- "setting the same icons on mails, their attachments and their tags while indexing; is this necessary? This is commented in non-mail feeder plugins. The icons are added to have pretty search results. The (expensive) resource identification only happens when creating new SimpleResource objects, not when setting existing URIs as properties. So, simply caching the icons should fix this.
Repeated indexing per item
Failures to index items
- FIXED Cardinality fault on messageHeader
- FIXED Cardinality fault on PIMO:Persons' propertiesd732592b in kde-runtime/master
Repeated indexing per collection
- FIXED Attempted indexing of collections we cannot index ec4f19eb781514ce0dfc09fe4e9ea4591ecc31e9 in kdepim-runtime/4.8
- FIXED Mark each collection on completion with indexing level 2729771b765d0bd6e0e03d0a5b055e36bc48944c in kdepim-runtime/master (does this prevent discovery of items changed while feeder was not running?)
Indexing interferes with other work
- FIXED Hide indexing until user is idle kde#289932#c58
Low nominal performance
- Eg. 5700 (42MB mbox) kde-core-devel mails in 20 minutes (4.8 items/sec) on Core i7-2620M (4x2.7GHz, HT), idle detection disabled. Not clear what is the bottleneck. Virtuoso using 80-90% of one core during this.
- Akonadi->feeder->dbus->nepomukstorage->virtuoso of all mail negates performance advantage of fast Akonadi protocol. Seeing the huge improvement after Sebastian's changes on the resource identification in DMS, I'd guess that this is where most of the time is spent. But that's just gut feeling. If that turns out to be true though, we can probably apply some more clever caching for e.g. email addresses (in a typically folder I'd assume some of them repeat quite often) to avoid running identification on them over and over again. List-Id is another good candidate for that.
Ability to utilise indexing work (working search)
Search features that fully use indexed data
- Quicksearch now does fulltext search in 4.9
- Indexed: Date, Subject, From, Sender, To, Cc, Bcc, List-Id, Organization, some X-headers, Status flags, Tags, Important, Todo, Watched, Plain text body. Searchable: Age(days), Subject, From, To, Cc, Reply-To, List-Id, Organization, some X-headers, Status flags, Tags, all headers (probably not useful), message body. Would be nice to capture List-Id: as mailing list resources in the NMO ontology so we can search explicitly for mails to lists.
- No way to search by the actual PIMO Persons/Contacts created by indexing, user must input part of name.
- No way to search attachments or whether something has an attachment
- WIP Till: Composer address auto-completion based on all available Nepomuk data.
Faults in search
- RE-BROKEN Truncated query strings cause broken search folders (Limit needs to be more than 1024 chars)
- Dialog allows modifying existing search folder by name but fails (modifies remote id)
- Possible to create search in search folders; doesn't work
Viewing search results changes search results
- search on unread message status, messages disappear from search as message preview makes them read
- Just viewing search results causes some messages to disappear from search collection. itemChanged currently is handled in the feeder as add/remove. For emails this case can be optimized for the common case of flag/tag changes, as they rarely change content.
Minimising indexing work
Assuming there is no/low demand for search, do less of the expensive indexing.
- Change default set of indexed folders
- Make it easy to change per folder indexing attribute
- Show indexing status, allow attr change directly in folder selector in search dialog.
- Indexing all except full text a useful compromise?