XtreemFS: 2010

Monday, August 2, 2010

Want to work on XtreemFS?

We have positions for PhD students at the Zuse Institute Berlin (Germany) where you have the opportunity to work on XtreemFS within the CONTRAIL project. You can email (kolbeck

zib.de) or call (+49-30-841-85-328) us for more information. Deadline is October 15th.

Here is the official job description:

The Zuse Institute Berlin (ZIB) is a non-university research institute under public law of the state of Berlin. In close interdisciplinary co-operation with the Berlin universities as well as national and international scientific institutions, ZIB conducts research and development in the field of information technology, applied mathematics, and computer science. To support research and development efforts in EU- and BMBF-funded projects, the department Parallel and Distributed Systems invites applications for several PhD Student or PostDoc Positions (f/m) for the duration of two years - Vgr. IIa/Ib BAT/Anwendungs-TV Land Berlin - Application code WA 22/10 As a research assistant you will explore, design, implement and evaluate scalable, fault-tolerant and distributed algorithms and systems for processing large-scale scientific data. We have developed a range of systems including: Scalaris, a structured peer-to-peer storage system; XtreemFS, a distributed and replicated file system, and BabuDB, a replicated key-value store. In co-operation with partners from science and industry we validate, extend, and optimize our solutions in production environments.

Requirements

Master's degree or Diploma in computer science
Solid fundamentals in distributed systems and algorithms
Experience with distributed file systems, databases or peer-to-peer technology
Demonstrated coding skills in C++, Java or Erlang
Familiarity with Unix/Linux
Ability to work in interdisciplinary and international teams
Fluency in English

You will work in an inspiring and pleasant environment and will receive adequate professional support. We offer challenging scientific tasks, a high degree of autonomy, and state-of-the-art technical infrastructure. You will have the opportunity to pursue a PhD or Habilitation supervised by Prof. Reinefeld. The position will be initially financed for a period of two years with the possibility of extension. The salary is based upon wage group IIa/Ib as per Berlin Collective Agreement for the Public Sector. Zuse Institute Berlin is an equal opportunity employer. We prefer to balance the number of female and male employees in our institute. Thus, we kindly encourage female candidates to apply to this job offer. Handicapped persons will be given preference to other equally qualified candidates. Please send your complete application, referring to application code WA 22/10, including cover letter, CV and relevant certificates/GPA/university transcripts until 15. October 2010 to Konrad-Zuse-Zentrum fuer Informationstechnik Berlin (ZIB), - Verwaltung -, Takustr. 7, 14195 Berlin, Germany

Monday, June 7, 2010

XtreemFS at LinuxTag 2010

We'll be at LinuxTag 2010 in Berlin which starts in two days (June 9th to 12th). Visit us at the XtreemOS booth #206 in Halle 7.2a.

Wednesday, May 26, 2010

ISC, Summer School ...

You can meet us at ISC 2010 in Hamburg from 31/05 to 02/06 at the XtreemOS booth (booth #121, next to Unicore and BSC).

The XtreemOS summer school will take place at Schloss Günzburg near Ulm from the 5th to the 9th of July. There is also a talk and a practical on XtreemFS.

Finally, the XtreemOS Challenge offers a prize of €1,000 for the best application ported to XtreemOS.

Friday, May 7, 2010

How XtreemFS uses leases for replication

If you implement a distributed file systems like XtreemFS, you are dealing with many interesting problems. The most central one is probably to make files behave as if they were stored in the local file system.

The main property of this sought behavior is called strong or sequential consistency [*]. Sequential consistency requires that reads and writes (even concurrent ones) are executed in a well-defined (but random) order. Apart from sequential consistency, the file system must also ensure that reads reflect all previous writes and that concurrent reads and writes are isolated.

As XtreemFS supports fault-tolerant file replication, it has to maintain the local behavior of files while internally storing and coordinating multiple physically independent copies of the data. In technical terms, this translates to implementing sequential consistency for replicated data in a fault-tolerant way.

The simplest way to implement sequential consistency is to use a central instance that defines the order in which operations change the data (a sequencer). And indeed many distributed file systems realize sequential consistency by establishing a lock server that hands out locks to clients or storage servers. The lock holder receives all operations on the data and executes them serially. The result is a well-defined order. These locks can be made fault-tolerant by attaching a timeout to them: what you get is a lease, which can be revoked even when a client is unresponsive or dead.

A sophisticated alternative to defining a sequencer is to use a so-called replicated state machine, a distributed algorithm that is executed by all replicas of the data. If you want implement a fault-tolerant version of it, you will end up with using a Paxos derivative. The problem is that all fault-tolerant algorithms in this domain require two round-trips for both reads and writes to the data to establish sequential consistency across all replicas, which is clearly to expensive for high-rate operations like the ones on files.

So we are left with the central instance approach, fully aware that this introduces both a single point of failure and performance bottleneck. A quick back-of-the-envelope calculation reveals: assuming 50.000 open files in your cluster, with a 30 sec timeout you have 1666/sec lease renewals, which already quite some load for a lock server.

Such a high lease renewal rate is even more of a problem when you consider fault-tolerance of the lock server itself. To ensure that it is highly available, you need to replicate its state, and are again faced with a sequential consistency + replication problem. The solutions: master-slave replication with some fail-over protocol (another lease problem?) or a replicated state machine for the lease server itself. The latter has been chosen by Google for their Chubby lock service. The replication of the lock servers state solves the availability problem, but worsens the performance bottleneck. Google's "Paxos Made Live" paper cites 640 ops/sec [3]. Not enough for 50k open files (although Chubby is not used for GFS' leases).

For XtreemFS, we have chosen a different approach. Instead of using a lock service to manage leases, we let the object storage devices (OSDs) negotiate leases among themselves. For each file, all OSDs that host a replica of the file negotiate the lease. The lease-holding OSD acts as a sequencer and receives and executes all reads and writes. In turn, an OSD participates in all lease negotiations for all its file replicas that are currently open.

Negotiating leases directly by the storage servers has several advantages: it scales naturally with the number of storage servers, it saved us from implementing a lock server, and the user from the headache of provisioning and managing another service. The only problem we needed to overcome was that we first needed an algorithm that negotiates leases in a fault-tolerant, decentralized way. Such a thing didn't exist, and telling from a recent blog post from Jeff Darcy, the usage of fault-tolerant leases still seams to be its infancy [Link].

The result of our efforts are two algorithms, FatLease [1] and its successor Flease [2]. They scale to thousands of concurrent lease negotations per second - for each set of participants. For XtreemFS this means essentially that the number of open files is counted per storage server and not against the whole file system. With 1000s/sec. negotiations, this would translate to an open file count of more than 50k files per OSD.

With a fault-tolerant lease negotiation algorithm, we have solved the problem of enforcing sequential consistency and arbitrating concurrent operations. While this is the hardest part of implementing replication, the data replicas also need to be updated for every change. How this is done in XtreemFS will be a topic of a future blog post.

[*] POSIX actually mandates a strong consistency model: serializabiliy. In simple terms, it means that the file system has to take communication between its clients into account. However, this is impractical for distributed file systems, as the file system would have to control all communication channels of its clients.

References:

[1] F. Hupfeld, B. Kolbeck, J. Stender, M. Högqvist, T. Cortes, J. Malo, J. Marti. “FaTLease: Scalable Fault-Tolerant Lease Negotiation with Paxos”. In: Cluster Computing 2009.

[2] B. Kolbeck, M. Högqvist, J. Stender, F. Hupfeld. “Fault-Tolerant and Decentralized Lease Coordination in Distributed Systems”. Technical Report 10-02, Zuse Institute Berlin, 2010.

[3] Tushar Chandra, Robert Griesemer, and Joshua Redstone. "Paxos made live". PODC '07: 26th ACM Symposium on Principles of Distributed Computing.

Tuesday, April 6, 2010

student position in Berlin

We have a student position for XtreemFS at ZIB in Berlin: http://www.zib.de/News/Jobs/index.en.html (in German)

Friday, March 5, 2010

XtreemFS user survey

You can help us make XtreemFS better. Let us know what you use XtreemFS for and which features you need. Fill out our user survey at http://www.xtreemfs.org/user_survey.php

Thursday, February 4, 2010

XtreemFS update 1.2.1

We just released an update for XtreemFS (version 1.2.1). This version contains mainly bug fixes, e.g. for FreeBSD and Fedora 12, and enhanced replica management. The new scrubber will automatically replace failed replicas.

Source code and packages are available for download on http://www.xtreemfs.org.

There is no change in the database or OSD storage, so an update from 1.2 should work out of the box. The usual advice: backup important data before upgrading.