Revision as of 14:59, 25 May 2010 by VHanda (talk | contribs) (typos)

User:VHanda/Metadata-backupsync

Metadata Backup Sync Ontology

Possible names for the Ontology

  • Nepomuk Backup Sync Ontology (NBSO)
  • Nepomuk Backup Ontology (NBO)
  • Nepomuk Identification Ontology (NIO)

Or we could just couple it with NAO. An additional ontology isn't really required.

Ontology Level Markup of identifying properties

nxx:identifyingProperty a rdf:Property ;
    rdfs:comment "A combination of values of all identifying properties
are very likely to be unique for a resource, however each identifying
property value is not likely to be unique unlike nao:identifier".

nxx:theIdentifyingProperty a rdf:Property ;
    rdfs:subPropertyOf nxx:identifyingProperty ;
    nrl:maxCardinality 1 ;
    rdfs:comment "A property or combination that will identify the Resource. 
An example of this is nfo:hashValue" .

Other stuff which could be added -

nxx:identifyingProperty could be the base identifying property and it could have 3 subProperties -

  • nxx:theIdentifyingProperty
  • nxx:optionalIdentifyingProperties
  • nxx:mandatoryIdentifyingProperties

I think this would be quite useless, as almost every property could be labeled as optional. Plus too many options are confusing.

Different Ontologies

Just a rough draft of what properties should be derived from nxx:identifyingProperty -

RDFS

  • rdfs:label

NAO

  • nao:identifier
  • nao:rating
  • nao:hasTag ?

NIE

  • nie:url
  • nie:isPartOf
  • nie:isStoredAs ?
  • nie:mimeType ?
NFO
  • nfo:fileName
  • nfo:hashValue -- nxx:theIdentifyingProperty
  • nfo:fileSize
NMM
  • nmm:performer
NCO
  • nco:url
  • nco:contactUID
  • nco:nickname
  • nco:imID
  • nco:fullname
  • nco:birthDate
  • nco:emailAddress

Some of these details might not be provided by default. This is where the Web Metadata extractor comes into play. :)

Identification File

Along with the diff file, a identification file would be produced which would contain all the nxx:identifyingProperties and rdf:type of every resource in the diff file. These two files could then to be exported to machine with which the metadata is to be synced/backed up.

Possible Problems

The diff file can get quite large over time, and hence contain a large number of resources. This would result in the identification files becoming really large. The most obvious solution is to note the last sync date, and accordingly truncate the diff files.

Syncing with multiple machines - The current approach would require every machine to transfer its sync file (diff file + identification file) to every other machine. That coupled with the identification and merging process would become quite computationally expensive, specially with real time syncing. (If implemented)


Redundant tuples - Currently Strigi doesn't reuse existing info. Eg- When creating metadata for multimedia files, a new contact is created for every nmm:performer. Even if the performer has the same name. So, we can't rely on certain properties like nmm:performer. Or we have rules which compare the nco:fullname whenever there are contact. But then contacts are just one case. Our solution should be generic. Ideally, it would be amazing if someone could fix Strigi.


Implementation Details

The core would consist of three components -

  1. Diff Service - This would be responsible for tracking changing and generating the diff file along with an identification file. Should be implemented as a Nepomuk Service - Mostly Implemented.
  2. Identifier - This is the system that accesses the ontology, and is responsible for taking a sync file and resolving identification issues.
  3. Merger - Responsible for merging two files once the identification process has been completed.

Diff Service

Please think of a better name. I hate this name! The basic outline has been implemented.

Identifier

(Rough brain dump)

System A = Machine with metadata who has generated the Sync file.

System B = Machine with whom A's metadata is to be synced.

While identifying which Resource from Machine A corresponds to what Resource from machine B. Possible cases -

  1. The resource is found.
  2. The resource is not found.
    1. The file/contact/physical representation on the file exists, but it has no metadata associated with it, and hence doesn't exist in the database.
    2. The file/contact/whatever doesn't actually exist in system B.

Differentiating between 2.1 and 2.2 is going to be hard.

Example - A file exists on system A which also exists on System B, but the file doesn't have any metadata on both the systems. A new tag is created on System A called "NewTag". Then both the systems are synced. A new tag could be created on System B cause the nao:identifier would tell us about the tags, but how are we supposed to find the correct file on system B? Its location and filename could have potentially changed, and even if they haven't, the database would have no records of the file. How do you differentiate between when you can't seem to find the file, and when the file doesn't exist on the other system?

Special Case : Strigi isn't enabled! Then most of the metadata won't exist, and the identifier would be useless.

Home Directory Change - The nie:url of two files will never be the same. Unless they are saved in the root directory. We'll need to workaround this.

Merger

The implementation should be too hard as we have the timestamps. The only important detail would be when certain resources can't be merged. Then either the issue is presented to the user or the best option is chosen. (or both)

GUI

The best case scenario would be to have the Core completely segregated from the GUI. Maybe they can communicate using DBus? This rationale behind this approach is that it should be easy for other users to integrate metadata backup/sync with their existing mechanisms for backup/sync.

The GUI should provide features like Automated Backup/sync, but these things should just be superfluous features that anyone can implement. The main work should be done in the Core.

GUIs will be required for -

  • Solving merge conflicts
  • Potentially identifying files (If the identifier fails to do so - Optional)
  • Configuring automated backup / sync - Integrate with existing Nepomuk KCM module.

Strigi

Currently Strigi doesn't calculate the hash/checksum. It's a slow and time consuming process. We need some kind of "slow indexer" which calculates all the hashes.


This page was last edited on 25 May 2010, at 15:04. Content is available under Creative Commons License SA 4.0 unless otherwise noted.