[tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Fri Mar 25 19:40:07 UTC 2011

Having read Zooko's original message more carefully, I think I was
responding to the wrong concern (which is probably why I deferred
sending that response for long enough to forget about it). Here's a
better response.

On 9/24/10 12:36 AM, Zooko O'Whielacronx wrote:

> However, I'm also sure that Tahoe-LAFS *failed* to scale up in a
> different way, and that other failure is why I jealously guard the
> secret entrance to the Volunteer Grid from passersby.
> 
> The way that it failed to scale up was like this: suppose you use K=3,
> H=7, M=10 erasure-coding. Then the more nodes in the system the more
> likely it is to incur a simultaneous outage of 5 different nodes
> (H-K+1), which *might* render some files and directories unavailable.
> (That's because some files or directories might be on only H=7
> different nodes. The failure of 8 nodes (M-K+1) will *definitely*
> render some files or directories unavailable.) ...
> 
> Okay, that doesn't sound too good, but it isn't that bad. You could
> say to yourself that at least the rate of unavailable or even
> destroyed files, expressed as a fraction of the total number of files
> that your grid is serving, should be low. *But* there is another
> design decision that mixes with this one to make things really bad.
> That is: a lot of maintenance operations like renewing leases and
> checking-and-repairing files, not to mention retrieving your files for
> download, work by traversing through directories stored in the
> Tahoe-LAFS filesystem. Each Tahoe-LAFS directory (which is stored in a
> Tahoe-LAFS file) is independently randomly assigned to servers.
> 
> See the problem? If you scale up the size of your grid in terms of
> servers *and* the size of your filesystem in terms of how many
> directories you have to traverse through in order to find something
> you want then you will eventually reach a scale where all or most of
> the things that you want are unreachable all or most of the time

Ah, ok, so the concern is about the interaction between two trends:

 1: while the percentage of unavailable file objects remains constant,
    the absolute number of them is growing along with the rest of the
    grid.

 2: the "diameter" (in the graph-theory sense) or typical directory tree
    depth is growing over time, as an individual user stores more and
    more files. Therefore the number of necessary successful downloads
    needed to read a leaf node is growing (you must be able to retrieve
    the root directory object, plus the subdirectory, etc, down to the
    actual file you care about). Therefore the probability of
    successfully reading an average leaf node is dropping over time.

I think it's sufficient to simply pay attention to the second trend,
ignoring the first, and express it like this:

 3: given a constant probability of file-object unavailability, the
    availability of a given file is a function of its depth in the
    directory tree, and this is likely to grow over time

Note that this affects a single user who stores more and more files over
time (and only retains a single out-of-band rootcap). The act of adding
more users to the system doesn't cause this problem (because they're
each holding their own rootcap). You might portray the concern as
scaling with the ratio of (number of files we care about) / (number of
out-of-band rootcaps we retain). Or better yet, (average depth of
directory tree) / (number of out-of-band rootcaps we retain).

Shawn's backup system, which flattens the directory tree into a single
table, drops the average-depth numerator to a nice constant "1". A
tahoe-like system that forsakes directories completely by retaining an
out-of-band table of filecaps would drop the numerator to 0.

I'd guess that, for most users, the average depth of their filesystems
is likely to grow logarithmically with the number of files, not
linearly, slowing the advancement of the problem somewhat. In fact, I
suspect that the average depth is closer to constant.. I tend to use the
same directory structure on each new machine I build, and very rarely
introduce a new level of subdirectories (when something gets too big),
maybe once every other year.

So to model that, I'd pick a maximum target directory depth (say 10),
calculate Prob(success-of-file-retrieval) as being equal to getting 10
simultaneous Prob(success-of-object-retrieval) (i.e. 1-(1-Pobj)**10),
then choose k-of-N to get Prob(success-of-file-retrieval) above my goal.
It means you need more margin, certainly, but not anything too
difficult. And the rule would be that you get enough reliability as long
as you don't go crazy with your directory structures.

(there are other problems with very deep directory trees on local
filesystems too: OS limits on pathnames, usability limits of tools and
cut-and-paste of pathnames, reliability losses as you touch more and
more local disk blocks for the dnodes, etc, so it's not an entirely
foreign problem)

> This is what happened to allmydata.com, when their load grew and their
> financial and operational capacity shrank so that they couldn't
> replace dead hard drives, add capacity, and run deep-check-and-repair
> and deep-add-lease.

That's the concern my previous message responded to, because I think it
wasn't entirely accurate. It implies that dead hard drives were a
problem (when in fact we never lost enough to threaten anything), that
adding capacity would have helped, and that repair would have helped.
The actual problem was that the money ran out and the whole grid was
shut down: the revenue (i.e. customers being able to get their files)
and the cost (running those servers) were both all-or-nothing.

It *would* have been marginally helpful (but probably not practical or
feasible) to design a system that could save costs by scaling *down*
reliability while still retaining availability. Having all 100+ drives
from day one (or implementing continuous rebalancing to make it look
like you'd had them all along), and setting N much higher from day one
(so data was spread over all drives, not just a relatively-small subset)
would accomplish that. If the money had gradually slowed down, we could
have powered down half the grid, cut the operational costs in half, and
continued to provide full service despite reduced redundancy (thus still
bringing in full revenue).

>From the business point of view, I don't think that would have worked:
either we'd have needed to spend a lot more money up front (to buy two
years worth of hard drives and servers ahead of time), or spend the
engineering time and bandwidth and CPU time to do continuous
rebalancing, and the only benefit to the company would have been to be
able to cut our burn rate by a little bit without needing to shut off
all the revenue-generating accounts at the same time. But I think the
servers were not the majority of the business costs (compared to
salaries and office space, etc), and colo space is pretty quantized, so
I'm doubtful that the ability to scale downwards would have helped much.
(and what dot-com -era investor wants to hear that you're going to spend
even *more* of their money to create a company that can fail gracefully?
work hard, grow fast, reach for the stars or die trying, by golly :).

>From the customer's point of view, of course, graceful-scale-down is
better than all-or-nothing. But again, from a marketing point of view,
how much confidence would your average consumer have in a backup company
which was visibly planning for failure from the beginning?

Anyways, yeah, the does-it-scale-down question is a good one to ask, and
evaluating its benefits against its costs (rebalancing bandwidth,
basically) is a good comparison to make. I think the take-home message
may be for us to put more effort into a rebalancing and/or reencoding
strategy. I remember talking somewhere recently about experimenting with
fixing the encoding at like 3-of-256, always generating all 256 shares
regardless of how many servers were available, but only uploading a
reasonable subset of them (one per server up to some maximum). Then,
later, when more servers become available, you can spread the file out
further.

Another thing to consider is a form of immutable file which has mutable
encoding parameters. There would be a special "uploader" or "rebalancer"
cap, which has a signing key, and the root of the share merkle tree is
signed (not hashed) by the UEB. The plaintext/ciphertext merkle tree
*would* be hashed, so the actual contents of the file are immutable and
fixed by the readcap. But someone with the rebalance-cap could generate
new shares with different encoding parameters, so it could start with
3-of-10 when there are only 10 servers, but get changed to 30-of-100
when there are 100 servers and retain the filecap. Of course, the
question of who holds on to the rebalance-cap is a troublesome one:
anyone who holds it can create invalid shares that are inefficient to
detect and recover from, so you don't want to spread it around, but
they're a hassle to store. (hm, what if we stored them in dirnodes like
mutable writecaps? read-only users wouldn't be able to help re-encode
the file, but the original uploader might. It might interact weirdly
with the immutable directories created by 'tahoe backup', but we could
probably make it work).

cheers,
 -Brian