User:VHanda/Metadata-backupsync: Difference between revisions

    From KDE TechBase
    (Added Strigi)
    m (typos)
    Line 60: Line 60:
    ===== NCO =====
    ===== NCO =====
    * nco:url
    * nco:url
    * nco:conttactUID
    * nco:contactUID
    * nco:nickname
    * nco:nickname
    * nco:imID
    * nco:imID
    Line 77: Line 77:


    '''Syncing with multiple machines''' - The current approach would require every machine to transfer its sync file (diff file + identification file) to every other machine. That coupled with the identification and merging process would become quite computationally expensive, specially with real time syncing. (If implemented)
    '''Syncing with multiple machines''' - The current approach would require every machine to transfer its sync file (diff file + identification file) to every other machine. That coupled with the identification and merging process would become quite computationally expensive, specially with real time syncing. (If implemented)
    '''
     
    Redundant tuples''' - Currently Strigi doesn't reuse existing info. Eg- When creating metadata for multimedia files, a new contact is created for every nmm:performer. Even if the performer has the same name. So, we can't rely on certain properties like nmm:performer. Or we have rules which compare the nco:fullname whenever there are contact. But then contacts are just one case. Our solution should be generic.  
     
    '''Redundant tuples''' - Currently Strigi doesn't reuse existing info. Eg- When creating metadata for multimedia files, a new contact is created for every nmm:performer. Even if the performer has the same name. So, we can't rely on certain properties like nmm:performer. Or we have rules which compare the nco:fullname whenever there are contact. But then contacts are just one case. Our solution should be generic.  
    Ideally, it would be amazing if someone could fix Strigi.  
    Ideally, it would be amazing if someone could fix Strigi.  



    Revision as of 14:59, 25 May 2010

    Metadata Backup Sync Ontology

    Possible names for the Ontology

    • Nepomuk Backup Sync Ontology (NBSO)
    • Nepomuk Backup Ontology (NBO)
    • Nepomuk Identification Ontology (NIO)

    Or we could just couple it with NAO. An additional ontology isn't really required.

    Ontology Level Markup of identifying properties

    nxx:identifyingProperty a rdf:Property ;
        rdfs:comment "A combination of values of all identifying properties
    are very likely to be unique for a resource, however each identifying
    property value is not likely to be unique unlike nao:identifier".
    
    nxx:theIdentifyingProperty a rdf:Property ;
        rdfs:subPropertyOf nxx:identifyingProperty ;
        nrl:maxCardinality 1 ;
        rdfs:comment "A property or combination that will identify the Resource. 
    An example of this is nfo:hashValue" .
    

    Other stuff which could be added -

    nxx:identifyingProperty could be the base identifying property and it could have 3 subProperties -

    • nxx:theIdentifyingProperty
    • nxx:optionalIdentifyingProperties
    • nxx:mandatoryIdentifyingProperties

    I think this would be quite useless, as almost every property could be labeled as optional. Plus too many options are confusing.

    Different Ontologies

    Just a rough draft of what properties should be derived from nxx:identifyingProperty -

    RDFS

    • rdfs:label

    NAO

    • nao:identifier
    • nao:rating
    • nao:hasTag ?

    NIE

    • nie:url
    • nie:isPartOf
    • nie:isStoredAs ?
    • nie:mimeType ?
    NFO
    • nfo:fileName
    • nfo:hashValue -- nxx:theIdentifyingProperty
    • nfo:fileSize
    NMM
    • nmm:performer
    NCO
    • nco:url
    • nco:contactUID
    • nco:nickname
    • nco:imID
    • nco:fullname
    • nco:birthDate
    • nco:emailAddress

    Some of these details might not be provided by default. This is where the Web Metadata extractor comes into play. :)

    Identification File

    Along with the diff file, a identification file would be produced which would contain all the nxx:identifyingProperties and rdf:type of every resource in the diff file. These two files could then to be exported to machine with which the metadata is to be synced/backed up.

    Possible Problems

    The diff file can get quite large over time, and hence contain a large number of resources. This would result in the identification files becoming really large. The most obvious solution is to note the last sync date, and accordingly truncate the diff files.

    Syncing with multiple machines - The current approach would require every machine to transfer its sync file (diff file + identification file) to every other machine. That coupled with the identification and merging process would become quite computationally expensive, specially with real time syncing. (If implemented)


    Redundant tuples - Currently Strigi doesn't reuse existing info. Eg- When creating metadata for multimedia files, a new contact is created for every nmm:performer. Even if the performer has the same name. So, we can't rely on certain properties like nmm:performer. Or we have rules which compare the nco:fullname whenever there are contact. But then contacts are just one case. Our solution should be generic. Ideally, it would be amazing if someone could fix Strigi.


    Implementation Details

    The core would consist of three components -

    1. Diff Service - This would be responsible for tracking changing and generating the diff file along with an identification file. Should be implemented as a Nepomuk Service - Mostly Implemented.
    2. Identifier - This is the system that accesses the ontology, and is responsible for taking a sync file and resolving identification issues.
    3. Merger - Responsible for merging two files once the identification process has been completed.

    Diff Service

    Please think of a better name. I hate this name! The basic outline has been implemented.

    Identifier

    (Rough brain dump)

    System A = Machine with metadata who has generated the Sync file.

    System B = Machine with whom A's metadata is to be synced.

    While identifying which Resource from Machine A corresponds to what Resource from machine B. Possible cases -

    1. The resource is found.
    2. The resource is not found.
      1. The file/contact/physical representation on the file exists, but it has no metadata associated with it, and hence doesn't exist in the database.
      2. The file/contact/whatever doesn't actually exist in system B.

    Differentiating between 2.1 and 2.2 is going to be hard.

    Example - A file exists on system A which also exists on System B, but the file doesn't have any metadata on both the systems. A new tag is created on System A called "NewTag". Then both the systems are synced. A new tag could be created on System B cause the nao:identifier would tell us about the tags, but how are we supposed to find the correct file on system B? Its location and filename could have potentially changed, and even if they haven't, the database would have no records of the file. How do you differentiate between when you can't seem to find the file, and when the file doesn't exist on the other system?

    Special Case : Strigi isn't enabled! Then most of the metadata won't exist, and the identifier would be useless.

    Home Directory Change - The nie:url of two files will never be the same. Unless they are saved in the root directory. We'll need to workaround this.

    Merger

    The implementation should be too hard as we have the timestamps. The only important detail would be when certain resources can't be merged. Then either the issue is presented to the user or the best option is chosen. (or both)

    GUI

    The best case scenario would be to have the Core completely segregated from the GUI. Maybe they can communicate using DBus? This rationale behind this approach is that it should be easy for other users to integrate metadata backup/sync with their existing mechanisms for backup/sync.

    The GUI should provide features like Automated Backup/sync, but these things should just be superfluous features that anyone can implement. The main work should be done in the Core.

    GUIs will be required for -

    • Solving merge conflicts
    • Potentially identifying files (If the identifier fails to do so - Optional)
    • Configuring automated backup / sync - Integrate with existing Nepomuk KCM module.

    Strigi

    Currently Strigi doesn't calculate the hash/checksum. It's a slow and time consuming process. We need some kind of "slow indexer" which calculates all the hashes.