Metadata Backup Sync Ontology

Possible names for the Ontology

Nepomuk Backup Sync Ontology (NBSO)
Nepomuk Backup Ontology (NBO)
Nepomuk Identification Ontology (NIO)

Ontology Level Markup of identifying properties

nxx:identifyingProperty a rdf:Property ;
    rdfs:comment "A combination of values of all identifying properties
are very likely to be unique for a resource, however each identifying
property value is not likely to be unique unlike nao:identifier".

nxx:theIdentifyingProperty a rdf:Property ;
    rdfs:subPropertyOf nxx:identifyingProperty ;
    nrl:maxCardinality 1 ;
    rdfs:comment "A property or combination that will identify the Resource. 
An example of this is nfo:hashValue" .

Other stuff which could be added -

nxx:identifyingProperty could be the base identifying property and it could have 3 subProperties.

nxx:theIdentifyingProperty
nxx:optionalIdentifyingProperties
nxx:mandatoryIdentifyingProperties

I think this would be quite useless, as almost every property could be labeled as optional. Plus too many options are confusing.

Different Ontologies

Just a rough draft of what properties should be derived from nxx:identifyingProperty -

NAO

nao:identifier

NIE

nie:url
nie:isPartOf
nie:isStoredAs ?
nie:mimeType ?

NFO

nfo:fileName
nfo:hashValue
nfo:fileSize

NID3

nid3:albumTitle
nid3:fileType
nid3:InvolvedPerson ?
nid3:publisher ?
nid3:length
nid3:track
nid3:trackNumber ?

Some of these details might not be provided by default. This is where to Web Metadata extractor comes into play. :)

Identification File

Along with the diff file, a identification file would be produced which would contain all the nxx:identifyingProperties and rdf:type of every resource in the diff file. These two files could then to be exported to machine with which the metadata is to be synced/backed up.

Possible Problems

The diff file can get quite large over time, and hence contain a large number of resources. This would result in the identification files becoming really large. The most obvious solution is to note the last sync date, and accordingly truncate the diff files.

Syncing with multiple machines - The current approach would require every machine to transfer its sync file (diff file + identification file) to every other machine. That coupled with the identification and merging process would become quite computationally expensive, specially with real time syncing. (If implemented)

Implementation Details

The core would consist of three components -

Diff Service - This would be responsible for tracking changing and generating the diff file along with an identification file. Should be implemented as a Nepomuk Service - Mostly Implemented.
Identifier - This is the system that accesses the ontology, and is responsible for taking a sync file as resolving identification issues.
Merger - Responsible for merging two files once the identification process has been completed.

Diff Service

Please think of a better name. I hate this name. The basic outline has been implemented.

Identifier

(Rough brain dump)

System A = Machine with metadata who has generated the Sync file.

System B = Machine with whom A's metadata is to be synced.

While identifying which Resource from Machine A corresponds to what Resource from machine B. Possible cases -

The resource is found.
The resource is not found.
1. The file/contact/physical representation on the file exists, but it has no metadata associated with it, and hence doesn't exist in the database.
2. The file/contact/whatever doesn't actually exist in system B.

Differentiating between 2.1 and 2.2 is going to be hard.

Example - A file exists on system A which also exists on System B, but the file doesn't have any metadata on both the systems. A new tag is created on System A called "NewTag". Then both the systems are synced. A new tag could be created on System B cause the nao:identifier would tell us about the tags, but how are we supposed to find the correct file on system B? Its location and filename could have potentially changed, and even if they haven't, the database would have no records of the file. How do you differentiate between when you can't seem to find the file, and when the file doesn't exist on the other system?

Merger

The implementation should be too hard as we have the timestamps. The only important detail would be when certain resources can't be merged. Then either the issue is presented to the user or the best option is chosen. (or both)