[tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Sat Sep 25 04:07:52 UTC 2010

Something I've been wondering: how is reliability affected by sending
duplicate shares to different servers? For example, if we encode with
a 3/7/10 code, but we have 20 servers available, and we send each
share to two servers, is the reliability significantly worse than
using a 3/??/20 code? Using such a scheme would be simpler than
dynamically adjusting the encoding used to the number of servers, and
it wouldn't have the 256 share limitation that zfec has.

Another (possibly even sillier) question: Is there a performance
reason not to generate as many shares as possible, and only upload as
many unique shares as we can to different hosts? This would be a
completely different allocation strategy than what Tahoe uses now, but
it might be more reliable. It'd also use as much space as possible,
though, and the space usage wouldn't be very predictable, so actually
upon reflection this isn't that great an idea. Still worth mentioning
though, I think.

--Ravi

On Fri, Sep 24, 2010 at 2:36 AM, Zooko O'Whielacronx <zooko at zooko.com> wrote:
> On Mon, Sep 20, 2010 at 10:14 AM, Shawn Willden <shawn at willden.org> wrote:
>>
>> But Tahoe isn't really optimized for large grids.  I'm not
>> sure how big the grid has to get before the overhead of all of the
>> additional queries to place/find shares begin to cause significant
>> slowdowns, but based on Zooko's reluctance to invite a lot more people
>> into the volunteer grid (at least, I've perceived such a reluctance),
>> I suspect that he doesn't want too many more than the couple of dozen
>> nodes we have now.
>
> The largest Tahoe-LAFS grid that has ever existed as far as I know was
> the allmydata.com grid. It had about 200 storage servers, where each
> one was a userspace process which had exclusive access to one spinning
> disk. Each disk was 1.0 TB except a few were 1.5 TB, so the total raw
> capacity of the grid was a little more than 200 TB. It used K=3, M=10
> erasure coding so the cooked capacity was around 70 GB.
>
> Most of the machines were 1U servers with four disks, a few were 2U
> servers with six, eight, or twelve disks, and one was a Sun Thumper
> with thirty-six disks. (Sun *gave* us that Thumper, which was only
> slightly out of date at the time, just because we were a cool open
> source storage startup. Those were the days.)
>
> Based on that experience, I'm sure that Tahoe-LAFS scales up to at
> least 200 nodes in the performance of queries to place/find shares.
> (Actually, those queries have been significantly optimized since then
> by Kevan and Brian for both upload and download of mutable files, so
> modern Tahoe-LAFS should perform even better in that setting.)
>
> However, I'm also sure that Tahoe-LAFS *failed* to scale up in a
> different way, and that other failure is why I jealously guard the
> secret entrance to the Volunteer Grid from passersby.
>
> The way that it failed to scale up was like this: suppose you use K=3,
> H=7, M=10 erasure-coding. Then the more nodes in the system the more
> likely it is to incur a simultaneous outage of 5 different nodes
> (H-K+1), which *might* render some files and directories unavailable.
> (That's because some files or directories might be on only H=7
> different nodes. The failure of 8 nodes (M-K+1) will *definitely*
> render some files or directories unavailable.) In the allmydata.com
> case, this could happen due to the simultaneous outage of any two of
> the 1U servers, or any one of the bigger servers, for example [*]. In
> a friendnet such as the Volunteer Grid, this would happen if we had
> enough servers that occasionally their normal level of unavailability
> would coincide on five or more of them at once.
>
> Okay, that doesn't sound too good, but it isn't that bad. You could
> say to yourself that at least the rate of unavailable or even
> destroyed files, expressed as a fraction of the total number of files
> that your grid is serving, should be low. *But* there is another
> design decision that mixes with this one to make things really bad.
> That is: a lot of maintenance operations like renewing leases and
> checking-and-repairing files, not to mention retrieving your files for
> download, work by traversing through directories stored in the
> Tahoe-LAFS filesystem. Each Tahoe-LAFS directory (which is stored in a
> Tahoe-LAFS file) is independently randomly assigned to servers.
>
> See the problem? If you scale up the size of your grid in terms of
> servers *and* the size of your filesystem in terms of how many
> directories you have to traverse through in order to find something
> you want then you will eventually reach a scale where all or most of
> the things that you want are unreachable all or most of the time. This
> is what happened to allmydata.com, when their load grew and their
> financial and operational capacity shrank so that they couldn't
> replace dead hard drives, add capacity, and run deep-check-and-repair
> and deep-add-lease.
>
> I want to emphasize the scalability aspect of this problem. Sure
> anyone who has tight finances and high load can have problems, but in
> this case the problem was worse the larger the scale.
>
> All of the data that its customers entrusted to it is still in
> existence on those 200 disks, but to access *almost any* of that data
> requires getting *almost all* of those disks spinning at the same
> time, which is beyond allmydata.com's financial and operational
> capacity right now.
>
> So, I have to face the fact that we designed a system that is
> fundamentally non-scalable in this way. Hard to swallow.
>
> On the bright side, writing this letter has shown me a solution! Set M
>>= the number of servers on your grid (while keeping K/M the same as
> it was before). So if you have 100 servers on your grid, set K=30,
> H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of
> servers which can fail and cause any file or directory to fail.
>
> With that tweak in place, Tahoe-LAFS scales up to about 256 separate
> storage server processes, each of which could have at least 2 TB (a
> single SATA drive) or an arbitrarily large filesystem if you give it
> something fancier like RAID, ZFS, SAN, etc.. That's pretty scalable!
>
> Please learn from our mistake: we were originally thinking of the
> beautiful combinatorial math that shows you that with K=3, M=10, with
> a "probability of success of one server" being 0.9, you get some
> wonderful "many 9's" fault-tolerance. This is what Nicholas Nassim
> Taleb in "The Black Swan" calls The Ludic Fallacy. The Ludic Fallacy
> is to think that what you really care about can be predicted with a
> nice simple mathematical model.
>
> Brian, with a little help from Shawn, developed some slightly more
> complex mathematical models, such as this one that comes with
> Tahoe-LAFS:
>
> http://pubgrid.tahoe-lafs.org/reliability/
>
> (Of course mathematical models are a great help for understanding. It
> isn't a fallacy to use them; it is a fallacy to think that they are a
> sufficient guide to action.)
>
> It would be a good exercise to understand how the allmydata.com
> experience fits into that "/reliability/" model or doesn't fit into
> it. I'm not sure, but I think that model must fail to account for the
> issues raised in this letter, because that model is "scale free"—there
> is no input to the model to specify how many hard drives or how many
> files (except indirectly in check_period which has to grow as the
> number of files grows), but my argument above persuades me that the
> architecture is not scale-free: if you add enough hard drives
> (exceeding your K,H,M parameters) and enough files then it is
> guaranteed to fail.
>
> Okay, I have more to say in response to Shawn's comments about grid
> management, but I think this letter is in danger of failing due to
> having scaled up during the course of this evening. ;-) So I will save
> some of my response for another letter.
>
> Regards,
>
> Zooko
>
> http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1199# document known scaling issues
>
> [*] Historical detail: allmydata.com was before Servers Of Happiness
> so they had it even worse—a file might have had all of its shares
> bunched up onto a single disk.
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at tahoe-lafs.org
> http://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
>