Tuesday, March 24, 2015

Consistency while Adding and Removing Replicas

As mentioned in the release notes for XtreemFS 1.5.1 the adding and removing of replicas got more robust. Previously there had been border cases where access based on an outdated replica set could result in data inconsistencies. With XtreemFS 1.5.1 a protocol has been established that ensures consistency in any case.

To understand why those inconsistencies could occur, you have to recall the nature of file access in distributed systems like XtreemFS where metadata and file data is separated.
Opening a file results in a call to the Metadata and Replica Catalog (MRC) which returns the set of replicas and a capability. File data access, like reading or writing, is done directly among the client and the Object Storage Devices (OSD) that are listed in the replica set. The MRC is no longer involved, since access is granted as long as the capability is valid.

Data access in XtreemFS
It is apparent that without further action, different clients could obtain different replica sets for the same file, in case replicas have been added or removed in-between. As the quorum required by the R/W replication is also established by the replicas listed in the replica set, it is possible that different clients access data on a non intersecting sub sets of the replicas.

Consider for example the case that a file has five replicas called A, B, C, D and E. Then a valid majority is A, B and C, even if D and E are not online. This could happen for example if a link between some regions is highly unstable. If the replicas A, B and C are to be removed now, it has to be ensured both, that no client is allowed to write anymore data to A, B and C based on the old replica set, and that data previously not replicated to D and E is transferred to them prior to the installation of the new replica set consisting of just D and E.

The protocol that has been introduced with XtreemFS 1.5.1 does just that. It is extending the replica set to contain a version number, denies access based on outdated versions and uses the MRC to coordinate changes to the replica set between the involved replicas.

The coordination is central to the protocol and involves three stages. First a majority of the old replicas is getting invalidated, to ensure the data does not change during the second stage. The second stage ensures that the latest file data is transferred and updated on majority of the new replicas. Third and lastly the new replica set is installed with an incremented version number.

After the installation the new replica set is returned to clients opening the file. The installation on the replicas happens implicitly if a replica set with a higher version number is encountered.
Then again, if a client tries to access a replica with an outdated replica set is is denied. In XtreemFS 1.5.1 both the libxtreemfs for Java and C++ are handling errors based on outdated replica sets transparently by reloading the replica set from the MRC and retrying the request.

The new feature allows easily adding and removing of replicas for users and guarantees data consistency. Although it is compatible with clients build from the previous versions, it is recommend to update clients and servers simultaneously to profit from the transparent reloading of outdated replica sets.


Thursday, March 12, 2015

XtreemFS 1.5.1 Released

A new stable release of the distributed file system XtreemFS is available. XtreemFS 1.5.1 comes with the following major features:
  • Improved Hadoop support: The Hadoop Adapter supports Hadoop-2.x and other applications running on the YARN platform.
  • Consistent adding and removing replicas for R/W replication: Replica consistency is ensured while adding and removing replicas, xtfs_scrub can replace failed replicas automatically.
  • Improved SSL mode: The used SSL/TLS version is selectable, strict certificate chain checks are possible, the SSL code on client and server side was improved.
  • Better support for mounting XtreemFS using /etc/fstab: All mount parameters can be passed to the client by mount.xtreemfs -o option=value.
  • Initial version of an LD_PRELOAD based client: The client comes in the form of a library that can be linked to an application via LD_PRELOAD. File system calls to XtreemFS are directly forwarded to the services without FUSE. The client is intended for systems without FUSE or performance critical applications (experimental).
  • The size of a volume can be limited: Added quota support on volume level. The capacity limits are currently checked while opening a file on the MRC.
  • OSD health monitoring: OSDs can report their health, e.g. determined by SMART values, to the DIR. The results are aggregated in the DIR web interface. The default OSD selection policy can skip unhealthy OSDs.
  • Minor bugfixes and improvements across all components: See the CHANGELOG for more details and references to the issue numbers.
Furthermore we provide Dockerfiles to run the XtreemFS services in containers. The Dockerfiles are available in a separate Git repository at https://github.com/xtreemfs/xtreemfs-docker.

To ease contributing to XtreemFS for new developers, we added a Vagrantfile to the XtreemFS Git repository that allows setting up a virtual machine having all dependencies to build XtreemFS automatically.

The development of this release was partially funded by the European Commission in the HARNESS project under Grant Agreement No. 318521, as well as the German projects FFMK, GeoMultiSens and BBDC.