To understand why those inconsistencies could occur, you have to recall the nature of file access in distributed systems like XtreemFS where metadata and file data is separated.
Opening a file results in a call to the Metadata and Replica Catalog (MRC) which returns the set of replicas and a capability. File data access, like reading or writing, is done directly among the client and the Object Storage Devices (OSD) that are listed in the replica set. The MRC is no longer involved, since access is granted as long as the capability is valid.
Data access in XtreemFS |
Consider for example the case that a file has five replicas called A, B, C, D and E. Then a valid majority is A, B and C, even if D and E are not online. This could happen for example if a link between some regions is highly unstable. If the replicas A, B and C are to be removed now, it has to be ensured both, that no client is allowed to write anymore data to A, B and C based on the old replica set, and that data previously not replicated to D and E is transferred to them prior to the installation of the new replica set consisting of just D and E.
The protocol that has been introduced with XtreemFS 1.5.1 does just that. It is extending the replica set to contain a version number, denies access based on outdated versions and uses the MRC to coordinate changes to the replica set between the involved replicas.
The coordination is central to the protocol and involves three stages. First a majority of the old replicas is getting invalidated, to ensure the data does not change during the second stage. The second stage ensures that the latest file data is transferred and updated on majority of the new replicas. Third and lastly the new replica set is installed with an incremented version number.
After the installation the new replica set is returned to clients opening the file. The installation on the replicas happens implicitly if a replica set with a higher version number is encountered.
Then again, if a client tries to access a replica with an outdated replica set is is denied. In XtreemFS 1.5.1 both the libxtreemfs for Java and C++ are handling errors based on outdated replica sets transparently by reloading the replica set from the MRC and retrying the request.
The new feature allows easily adding and removing of replicas for users and guarantees data consistency. Although it is compatible with clients build from the previous versions, it is recommend to update clients and servers simultaneously to profit from the transparent reloading of outdated replica sets.