Tuesday, March 24, 2015

Consistency while Adding and Removing Replicas

As mentioned in the release notes for XtreemFS 1.5.1 the adding and removing of replicas got more robust. Previously there had been border cases where access based on an outdated replica set could result in data inconsistencies. With XtreemFS 1.5.1 a protocol has been established that ensures consistency in any case.

To understand why those inconsistencies could occur, you have to recall the nature of file access in distributed systems like XtreemFS where metadata and file data is separated.
Opening a file results in a call to the Metadata and Replica Catalog (MRC) which returns the set of replicas and a capability. File data access, like reading or writing, is done directly among the client and the Object Storage Devices (OSD) that are listed in the replica set. The MRC is no longer involved, since access is granted as long as the capability is valid.

Data access in XtreemFS
It is apparent that without further action, different clients could obtain different replica sets for the same file, in case replicas have been added or removed in-between. As the quorum required by the R/W replication is also established by the replicas listed in the replica set, it is possible that different clients access data on a non intersecting sub sets of the replicas.

Consider for example the case that a file has five replicas called A, B, C, D and E. Then a valid majority is A, B and C, even if D and E are not online. This could happen for example if a link between some regions is highly unstable. If the replicas A, B and C are to be removed now, it has to be ensured both, that no client is allowed to write anymore data to A, B and C based on the old replica set, and that data previously not replicated to D and E is transferred to them prior to the installation of the new replica set consisting of just D and E.

The protocol that has been introduced with XtreemFS 1.5.1 does just that. It is extending the replica set to contain a version number, denies access based on outdated versions and uses the MRC to coordinate changes to the replica set between the involved replicas.

The coordination is central to the protocol and involves three stages. First a majority of the old replicas is getting invalidated, to ensure the data does not change during the second stage. The second stage ensures that the latest file data is transferred and updated on majority of the new replicas. Third and lastly the new replica set is installed with an incremented version number.

After the installation the new replica set is returned to clients opening the file. The installation on the replicas happens implicitly if a replica set with a higher version number is encountered.
Then again, if a client tries to access a replica with an outdated replica set is is denied. In XtreemFS 1.5.1 both the libxtreemfs for Java and C++ are handling errors based on outdated replica sets transparently by reloading the replica set from the MRC and retrying the request.

The new feature allows easily adding and removing of replicas for users and guarantees data consistency. Although it is compatible with clients build from the previous versions, it is recommend to update clients and servers simultaneously to profit from the transparent reloading of outdated replica sets.


2 comments:

Unknown said...
This comment has been removed by a blog administrator.
Unknown said...
This comment has been removed by a blog administrator.