[tahoe-dev] Selecting optimal FEC parameters (also, idea for new peer selection policy)

Brian Warner warner at lothar.com
Wed Aug 19 21:24:32 UTC 2009

Good stuff!

Shawn Willden wrote:

> A local share is useless for ONE purpose of backups -- disaster
> recovery -- but it improves performance of retrievals for many other
> purposes of backups.

It would be neat if you could put your share-placement decisions into a
function and get back a couple of predictive numbers: the file-survival
probability, the predicted file-download speed (from various places),
etc, then compare the outcome of a couple of different choices.

> 2.  Retrieval performance is maximized when shares are retrived from
> as many servers at once (assuming all are roughly equally responsive).

Only if your downstream pipe is faster than the sum of the servers'
upstream (but only if those servers aren't doing anything else with
their upstream bandwidth). American DSL/cablemodem lines seem to mostly
have an 8-to-1 ratio between down and up. So if you assume a homogenous
mix of nodes, you don't get too much benefit from k>8.

Also, this discounts the overhead of talking to lots of servers
(especially in light of the serialized-peer-selection thread happening
elsewhere on tahoe-dev), and the system-wide cost of involving lots of
servers in your download operation. You might find that your downloads
are slower because all of those servers are also feeding data to other
downloaders, seeking back and forth between multiple shares, etc. At low
utilization this shouldn't be a big issue, but it should be considered
for more busy grids.

> This means that K should be set to the size of the grid, and M
> adjusted based on reliability requirements.

Incidentally, could you use "N" to refer to the number of shares
produced, rather than "M"? That would match the existing tahoe docs. (I
think Zooko researched the etymology and found justification for both
"N" and "M", but tahoe is currently using "N"). We still need another
letter to indicate the size of the grid.

> 3.  M larger than the grid means that each server will receive multiple 
> shares. A reasonable first assumption is that all shares on a given
> server will survive or fail together

Certainly the availability of a server (which depends upon it being
turned on, connected to a functioning network, running the tahoe
process, etc) is lower than the short-term reliability of the data it
serves (i.e. has the disk suffered an error yet). So yeah, I guess
that's a reasonable starting point.

> Setting M to K**2 (with K = number of servers) ensures that any one of
> the servers has all the information needed to restore any file, at the
> cost of an FEC expansion factor of K.

That's equivalent to using 1-of-(numservers) encoding (i.e. simple
replication), and changing the downloader to pull multiple segments in
parallel (i.e. the bittorrent approach).

I'm not sure what I feel about that. I suspect that an expansion factor
like that would be too much for a lot of people, but of course it
depends heavily upon both the incentives involved (are you buying
hardware for them, or asking to use some of their own?) and the
who-downloads-what goals (having a full copy of my pictures at my mom's
house means she can view them quickly).

A grid in which each node requires sum([usage(who) for who in allusers])
would be tough to sell to a general audience.. that's a lot of disk, and
it makes it seem like we don't have any of this fancy erasure-coding
goodness :).

> I need to do the math, but I suspect that if K / M (read K divides M),
> then reliability is not improved by increasing M by less than K.

I'd say that availability is not improved: you're effectively doing
1-of-(numservers) replication, so there's no benefit (from the client's
point of view) to storing 1.1, 2.0, or even a million copies of the same
data on a single server: it's only ever going to ask for one of them.

But the server can perform local repair if it has at least one extra
share. Holding K+1 shares in any given location (one one server, in one
datacenter, etc) means that the loss of a single share can be repaired
using only local bandwidth. We don't have any code to do this yet, but
it's feasible. So I'd say it does improve reliability.

As I've written elsewhere, I consider "availability" and "reliability"
to be two outputs of the same function:

 P(t,d) : probability that your file can be recovered, given that you
          don't start trying until time "t" and are willing to wait "d"
          seconds for results

where "availability" is basically where d=0, and "reliability" is where

But of course, then you have to consider the different probabilities of
that servers' drive (with it's K+1 shares) suffering a single bit-flip
versus dying completely. Once upon a time, drives tended to die slowly,
with unrecoverable sectors gradually increasing in number. I don't know
how they behave these days. It may well be the case that there's no
point to putting K+1 on a single disk. But K+1 in a single datacenter
(or on single computer with K+1 disks) may be useful.

> 5.  Not all servers provide the same amount of storage space, so some
> may get full and begin refusing to accept shares. I think the simplest
> way to address this is to simply exclude that server from the
> selection process and recompute K and M.

Incidentally, this is the sort of information which could be
probabilistically amortized by announcing it through the Introducer. If
you're doing a lot of uploads, you might be able to save considerable
RTT time by not querying the servers which are known to be full (and
querying others instead). You might still catch a server which filled up
since the last Introducer announcement, but it'd be rare.

If you require each server to have a full copy, then yeah, there's no
benefit to be had from servers that can only hold less than that. But
there is plenty of benefit to be had in the spaces inbetween: maybe my
mom's computer can only hold K/2 shares, but she still sees the pictures
faster than if she didn't have any data at all.

> If so, I'd like the upload to FAIL.

I've been reluctant to agree with this sort of behavior, but I think I
need to get over that. (I wonder if there's a sort of Meyers-Brigg
personality test that instead looks at your feelings about Brewer's CAP
theorem: apparently I've been landing on the "prefers availability over
consistency" side all this time. You could probably interpret the CAP
theorem as a sort of consistency-vs-availability-vs-reliability thing,
and apparently I've been on the availability-over-reliability side too.
But I digress).

Maybe if the upload() API included some sort of desired-reliability
parameter, and if the Uploader concluded that it couldn't be achieved,
the upload throws a ReliabilityGoalUnreachable exception (which, given
enough UI work, might give the user the ability to try again with
less-lofty goals). Maybe my reluctance has been based upon the mistaken
notion that this "upload failed" response would be indistinguishable
from a less-workaroundable failure, like having zero servers available,
or a coding problem.

> What this is shaping up to be is, perhaps, a peer selection policy
> that could be implemented in Tahoe which has as it's input parameter a
> required reliabity. The per-server reliability estimates could be
> supplied in a variety of ways. Ideally, long-term, I'd like Tahoe to
> estimate those reliabilities based on server availability history.

Yeah, this seems reasonable. Remember, however, that the download side
is still an unsolved problem. For small grids where you can afford to
query everybody, it's not a big deal, but we need to make Tahoe work for
large ones too. In the long run, maybe the filecap scheme will contain a
field to point to a server-selection policy, and maybe a second field
which passes extra information from the uploader to the downloader so
they can know where to look for the shares (which is currently derived
from the storage-index). Or we implement the #599 "servers know about
other shares" idea and then the downloader only needs to find one share.
Or we add a layer of indirection (at the expense of reliability) and
make the filecap contain a pointer to the serverlist. Or we use a more
static sever-selection approach and have all files uploaded by a given
client get put on the same servers.

> Ohhh... that's a good point. Allowing M and/or K to be changed by a
> repairer will require a new cap structure. That would be a good point
> in time to add a hamming distance parameter to the cap as well,
> assuming we decide that is sufficiently helpful to justify the extra
> length.

Yeah. #678 isn't going to be easy.. changing N requires a different
sized merkle tree over the share hashes, with a different root, which
invalidates the filecap's UEB hash. If the client is using filecap(N=10)
and finds shares generated for N=20, it may be reasonable for it to
compare the share hashes against the 10 that it knows about, but not the
other way round (it couldn't actually take advantage of shares #10-19,
because it can't validate them).

I'll respond to the hamming-distance idea elsewhere.


More information about the tahoe-dev mailing list