Friday, May 7, 2010

How XtreemFS uses leases for replication

If you implement a distributed file systems like XtreemFS, you are dealing with many interesting problems. The most central one is probably to make files behave as if they were stored in the local file system.

The main property of this sought behavior is called strong or sequential consistency [*]. Sequential consistency requires that reads and writes (even concurrent ones) are executed in a well-defined (but random) order. Apart from sequential consistency, the file system must also ensure that reads reflect all previous writes and that concurrent reads and writes are isolated.

As XtreemFS supports fault-tolerant file replication, it has to maintain the local behavior of files while internally storing and coordinating multiple physically independent copies of the data. In technical terms, this translates to implementing sequential consistency for replicated data in a fault-tolerant way.

The simplest way to implement sequential consistency is to use a central instance that defines the order in which operations change the data (a sequencer). And indeed many distributed file systems realize sequential consistency by establishing a lock server that hands out locks to clients or storage servers. The lock holder receives all operations on the data and executes them serially. The result is a well-defined order. These locks can be made fault-tolerant by attaching a timeout to them: what you get is a lease, which can be revoked even when a client is unresponsive or dead.

A sophisticated alternative to defining a sequencer is to use a so-called replicated state machine, a distributed algorithm that is executed by all replicas of the data. If you want implement a fault-tolerant version of it, you will end up with using a Paxos derivative. The problem is that all fault-tolerant algorithms in this domain require two round-trips for both reads and writes to the data to establish sequential consistency across all replicas, which is clearly to expensive for high-rate operations like the ones on files.

So we are left with the central instance approach, fully aware that this introduces both a single point of failure and performance bottleneck. A quick back-of-the-envelope calculation reveals: assuming 50.000 open files in your cluster, with a 30 sec timeout you have 1666/sec lease renewals, which already quite some load for a lock server.

Such a high lease renewal rate is even more of a problem when you consider fault-tolerance of the lock server itself. To ensure that it is highly available, you need to replicate its state, and are again faced with a sequential consistency + replication problem. The solutions: master-slave replication with some fail-over protocol (another lease problem?) or a replicated state machine for the lease server itself. The latter has been chosen by Google for their Chubby lock service. The replication of the lock servers state solves the availability problem, but worsens the performance bottleneck. Google's "Paxos Made Live" paper cites 640 ops/sec [3]. Not enough for 50k open files (although Chubby is not used for GFS' leases).

For XtreemFS, we have chosen a different approach. Instead of using a lock service to manage leases, we let the object storage devices (OSDs) negotiate leases among themselves. For each file, all OSDs that host a replica of the file negotiate the lease. The lease-holding OSD acts as a sequencer and receives and executes all reads and writes. In turn, an OSD participates in all lease negotiations for all its file replicas that are currently open.

Negotiating leases directly by the storage servers has several advantages: it scales naturally with the number of storage servers, it saved us from implementing a lock server, and the user from the headache of provisioning and managing another service. The only problem we needed to overcome was that we first needed an algorithm that negotiates leases in a fault-tolerant, decentralized way. Such a thing didn't exist, and telling from a recent blog post from Jeff Darcy, the usage of fault-tolerant leases still seams to be its infancy [Link].

The result of our efforts are two algorithms, FatLease [1] and its successor Flease [2]. They scale to thousands of concurrent lease negotations per second - for each set of participants. For XtreemFS this means essentially that the number of open files is counted per storage server and not against the whole file system. With 1000s/sec. negotiations, this would translate to an open file count of more than 50k files per OSD.

With a fault-tolerant lease negotiation algorithm, we have solved the problem of enforcing sequential consistency and arbitrating concurrent operations. While this is the hardest part of implementing replication, the data replicas also need to be updated for every change. How this is done in XtreemFS will be a topic of a future blog post.


[*] POSIX actually mandates a strong consistency model: serializabiliy. In simple terms, it means that the file system has to take communication between its clients into account. However, this is impractical for distributed file systems, as the file system would have to control all communication channels of its clients.

References:
[1] F. Hupfeld, B. Kolbeck, J. Stender, M. Högqvist, T. Cortes, J. Malo, J. Marti. “FaTLease: Scalable Fault-Tolerant Lease Negotiation with Paxos”. In: Cluster Computing 2009.

[2] B. Kolbeck, M. Högqvist, J. Stender, F. Hupfeld. “Fault-Tolerant and Decentralized Lease Coordination in Distributed Systems”. Technical Report 10-02, Zuse Institute Berlin, 2010.

[3] Tushar Chandra, Robert Griesemer, and Joshua Redstone. "Paxos made live". PODC '07: 26th ACM Symposium on Principles of Distributed Computing.

3 comments:

韋于倫成 said...

Well done!.......................................................

韋于倫成 said...

Hey!Nice post~........................................................

devid said...

Pretty good post. I just stumbled upon your blog and wanted to say that I have really enyed reading your blog posts. Any way I'll be subscribing to your feed and I hope you post again soon


server backups