Jump to content

User:VHanda/Metadata-backupsync

From KDE TechBase
Revision as of 11:50, 20 May 2010 by VHanda (talk | contribs) (GUI configuring automated backup / sync)

Metadata Backup Sync Ontology

Possible names for the Ontology

  • Nepomuk Backup Sync Ontology (NBSO)
  • Nepomuk Backup Ontology (NBO)
  • Nepomuk Identification Ontology (NIO)

Ontology Level Markup of identifying properties

nxx:identifyingProperty a rdf:Property ;
    rdfs:comment "A combination of values of all identifying properties
are very likely to be unique for a resource, however each identifying
property value is not likely to be unique unlike nao:identifier".

nxx:theIdentifyingProperty a rdf:Property ;
    rdfs:subPropertyOf nxx:identifyingProperty ;
    nrl:maxCardinality 1 ;
    rdfs:comment "A property or combination that will identify the Resource. 
An example of this is nfo:hashValue" .

Other stuff which could be added -

nxx:identifyingProperty could be the base identifying property and it could have 3 subProperties -

  • nxx:theIdentifyingProperty
  • nxx:optionalIdentifyingProperties
  • nxx:mandatoryIdentifyingProperties

I think this would be quite useless, as almost every property could be labeled as optional. Plus too many options are confusing.

Different Ontologies

Just a rough draft of what properties should be derived from nxx:identifyingProperty -

RDFS

  • rdfs:label

NAO

  • nao:identifier
  • nao:rating
  • nao:hasTag ?

NIE

  • nie:url
  • nie:isPartOf
  • nie:isStoredAs ?
  • nie:mimeType ?
NFO
  • nfo:fileName
  • nfo:hashValue -- nxx:theIdentifyingProperty
  • nfo:fileSize
NID3
  • nid3:albumTitle
  • nid3:fileType
  • nid3:InvolvedPerson ?
  • nid3:publisher ?
  • nid3:length
  • nid3:track
  • nid3:trackNumber ?
NCO
  • nco:url
  • nco:conttactUID
  • nco:nickname
  • nco:imID
  • nco:fullname
  • nco:birthDate
  • nco:emailAddress

Some of these details might not be provided by default. This is where the Web Metadata extractor comes into play. :)

Identification File

Along with the diff file, a identification file would be produced which would contain all the nxx:identifyingProperties and rdf:type of every resource in the diff file. These two files could then to be exported to machine with which the metadata is to be synced/backed up.

Possible Problems

The diff file can get quite large over time, and hence contain a large number of resources. This would result in the identification files becoming really large. The most obvious solution is to note the last sync date, and accordingly truncate the diff files.

Syncing with multiple machines - The current approach would require every machine to transfer its sync file (diff file + identification file) to every other machine. That coupled with the identification and merging process would become quite computationally expensive, specially with real time syncing. (If implemented)

Implementation Details

The core would consist of three components -

  1. Diff Service - This would be responsible for tracking changing and generating the diff file along with an identification file. Should be implemented as a Nepomuk Service - Mostly Implemented.
  2. Identifier - This is the system that accesses the ontology, and is responsible for taking a sync file and resolving identification issues.
  3. Merger - Responsible for merging two files once the identification process has been completed.

Diff Service

Please think of a better name. I hate this name! The basic outline has been implemented.

Identifier

(Rough brain dump)

System A = Machine with metadata who has generated the Sync file.

System B = Machine with whom A's metadata is to be synced.

While identifying which Resource from Machine A corresponds to what Resource from machine B. Possible cases -

  1. The resource is found.
  2. The resource is not found.
    1. The file/contact/physical representation on the file exists, but it has no metadata associated with it, and hence doesn't exist in the database.
    2. The file/contact/whatever doesn't actually exist in system B.

Differentiating between 2.1 and 2.2 is going to be hard.

Example - A file exists on system A which also exists on System B, but the file doesn't have any metadata on both the systems. A new tag is created on System A called "NewTag". Then both the systems are synced. A new tag could be created on System B cause the nao:identifier would tell us about the tags, but how are we supposed to find the correct file on system B? Its location and filename could have potentially changed, and even if they haven't, the database would have no records of the file. How do you differentiate between when you can't seem to find the file, and when the file doesn't exist on the other system?

Merger

The implementation should be too hard as we have the timestamps. The only important detail would be when certain resources can't be merged. Then either the issue is presented to the user or the best option is chosen. (or both)

GUI

The best case scenario would be to have the Core completely segregated from the GUI. Maybe they can communicate using DBus? This rationale behind this approach is that it should be easy for other users to integrate metadata backup/sync with their existing mechanisms for backup/sync.

The GUI should provide features like Automated Backup/sync, but these things should just be superfluous features that anyone can implement. The main work should be done in the Core.

GUIs will be required for -

  • Solving merge conflicts
  • Potentially identifying files (If the identifier fails to do so - Optional)
  • Configuring automated backup / sync - Integrate with existing Nepomuk KCM module.