Seamlessly collecting all kinds of different meta-data about your files (and other things) in one central place makes Nepomuk a really powerful technology for organizing and accessing your documents. Especially the prospect of being able to rely on your Computer to automatically remember any helpful meta-info (like download locations) sounds exciting.
When these possibilities will be more fully exploited in the future, extremely complex information systems might arise – which is not a problem, as long as it all serves to allow applications to assist the user in managing his documents and other data more easily and efficiently. At some point, however, having tons of interesting but undifferentiated information about any specific item (e.g. document) just “pile up” over time in “one big pot” might make it increasingly difficult for an application (say, desktop search) to really make full use of it.
Take, for a hypothetical example, a rich text document created by the user using his favorite Nepomuk-aware text processor.
- Case A: The user makes the effort to manually tag the document as “letter”. Nepomuk should obviously from then on treat the document as a letter.
- Case B: The word processor reports that the user started working on the document by loading the “personal letter” template. But did he really go on to create a letter in the end? Nepomuk can't know for sure. On the other hand, seeing as this might be the only meta-info that exists about the document, should it really be discarded altogether?
- Case C: A Nepomuk plugin scans all rich text documents in the database, and inside this one it finds salutation and complimentary close phrases in the places where you'd expect them in a letter, so it concludes it's probable that this is a letter. Still, does this suffice for automatically tagging the document as a letter as if the user himself had tagged it?
So I propose that in addition to having Nepomuk collect from different sources all kinds of meaningful properties about it's items in accordance to it's ontologies (like the “is a letter” property for a document), there should also be some unified way of “qualifying” these property-to-thing connections regarding their reliability/source/scope of validity and so on. So instead of merely saving unqualified connections like “x is y”, you might save things like:
- “x is y (by user input)”
- “x is y (by heuristic algorithm, medium credibility)”
- “x is y (by heuristic algorithm, high credibility) (source: post-processing plugin)”
- “x is y (by circumstantial evidence, low credibility) (source: original application)”
and so on, using some kind of “qualifier”-ontology.
Of course, all this information could instead be stored using separate tags (or whatever you call them) for each case, like in the above example you could have an “is letter” property for user annotation, a separate “is document-created-from-letter-template” property, and yet another “is document-deemed-a-letter-by-plugin_xyz” property. However, the information would not be nearly as useful then from the point of view of a client application, as it would have to know about the tag set by the word processor, and also about the existence of the post-processing plugin, in order to make use of it. If on the other hand all the Nepomuk entries logically describing or hinting at the being-a-letter status of the document were actually specified in terms of a single “is letter” property, but using generic qualifiers (see above), it would be easy for client applications to transparently make full use this info in a meaningful way (imagine, for example, a semantic desktop search application that when asked for all letters written in the last 2 months will show user-confirmed letters at the top, while showing documents with a low credibility “is letter” property only when the user clicks a “show further possible results” link. And it would show the same behaviour for all other qualifiable tags/properties, even if it really knows nothing about their meaning).
The more I think about what kind of complex semantic desktop features might be possible in the future, I feel this would be something that would really open up a lot of possibilities in terms of bringing advanced semantic capabilities to the desktop that would be as helpful for the user as possible, while at the same time getting in the way of the his/her actual work as little as possible.
--Sam 10:24, 25 May 2009 (UTC)
Some Possible Use-Cases
Data Collection Perspective
Following sources of meta-data collection would most likely benefit from this approach:
- Algorithms heuristically classifying documents based on their contents/structure (as in the example above)
- → client apps could then make use of this info, while still giving precedence to proper user-created classification data
- NLP algorithms linking real-life concepts to positions inside documents
- → NLP-developers could happily let their plugins attach all the (possibly weak) relations they might find, without worrying about incorrect or imprecise relations cluttering strong meta-data (created from user input or other sources) in search results, etc.
- Nepomuk-aware applications “guessing” meta-info about their docs, e.g.:
- office apps hinting at the logical document type by reporting the template used (see example above)
- web browsers attaching possible title/description tags to a downloaded file based on the caption of the download link / its context in the web page / etc.
- the file indexer attaching a possible category to a newly created file based on the current Nepomuk context at the time (see the "Categorize new files" idea)
Client Applications Perspective
Client applications might then, without having to know about the meanings of the qualified properties:
- sort search results based on the quality of the matching relations
- visually distinguish properties that are only “weakly” attached to a thing (e.g. show an auto-attached possible caption for a file in italics and grey font)
- This could actually be used for giving the user an incentive to add useful meta-data to his documents not only while creating/saving them, but also at any later time he might come across them. Imagine the grey-italics caption mentioned above showing “confirm” / “discard” / “change” actions when hovering over it with the mouse – no matter the context you encounter it in within KDE. (Personally I think this would be exactly the kind of non-intrusive but really helpful feature that would actually get many ordinary users excited for using semantic desktop technologies.)
- when browsing documents by category (tags, that is), suggest additional documents that are are already weakly linked to this tag (by, say, a NLP plugin) for proper (“strong”) inclusion in this category by the user
(If you can think of further use cases, please add them to the list so that a developer thinking about designing/implementing this will consider all aspects and implications involved)
Sebastian Trüg mentioned that the "named graphs" feature of Nepomuk, currently used for remembering when a user tagged something, might allow for this kind of optional "quality" to be stored with a relation.
- Scribo project: the Scribo project aims at implementing at least partly the features described by Case C in KDE, i.e. providing NLP tools for extracting metadata from text documents. The tools are meant to be configurable for analysing specific types of documents (such as letters, technical documents related to KDE). They rely on various analysis engines: Antelope by Proxem, CEA NLP tools, INRIA tools, OpenCalais, GATE engine etc. The specificity of the approach is that the user will be able to give feedback to the analysis engines for them to improve their heuristics. Scribo partners will take part in RMLL 2009. The roadmap of Scribo implementation in KDE is available on the Mandriva Scribo page and on .