[tahoe-dev] How Tahoe-LAFS fails to scale up and how to fix it (Re: Starvation amidst plenty)

Fri Sep 24 23:38:58 UTC 2010

"Zooko O'Whielacronx" <zooko at zooko.com> writes:

> On the bright side, writing this letter has shown me a solution! Set M
> = the number of servers on your grid (while keeping K/M the same as
> it was before). So if you have 100 servers on your grid, set K=30,
> H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of
> servers which can fail and cause any file or directory to fail.

There's a semi-related reliability issue, which is that a grid of N
servers which are each available most of the time should allow a user to
store, check and repair without a lot of churn.  So rather than setting
M to N, I'd want to (without precise justification) set it to 0.8N or
something, and use the not-yet-implemented facility to sprinkle more
shares during repair without invalidating the ones that are there.

For those that didn't follow my ticket, the example is

  assume 3/7/10.

  10 shares seqN placed onto 15 servers

  check finds 9 shares, because only 13/15 are up.

  currently, this results in writing 10 shares of seqN+1 on those 13

  next check, a different 13 are up, repeat

Instead, of check synthesized the missing share and placed it, then
there would be two copies of one share and still 10 reachable shares and
then as servers fade in and out the verify process can still succeed.
So for a/b availabilty and M shares, we know have M*b/a copies of M
shares placed, and I think that's ok.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20100924/374f9f25/attachment.asc>