[tahoe-dev] Low-effort repairs

Thu Jan 15 22:43:41 UTC 2009

On Thu, 15 Jan 2009 00:33:09 -0700
Shawn Willden <shawn-tahoe at willden.org> wrote:

> However, it occurs to me that there may be situations in which a quick and
> dirty repair job may be adequate, and much cheaper. Rather than
> regenerating the shares and delivering the actual lost copies, the repairer
> can simply make additional copies of the shares still remaining.

Yeah, the system we had before Tahoe (now referred to as "Mountain View", and
closely related to the Mnet/HiveCache codebase) used both expansion and
replication. I think we were using an expansion factor of 4.0x, and a
replication factor of 3.0x, for a total share size that was 12x the original
data.

Your analysis is completely correct.

We didn't put any energy into replication in Tahoe. One reason was that it
makes failure analysis harder (*which* share was lost now matters, so one of
the independent-failures assumptions must be dropped). Another reason was
that we figured that, since allmydata.com's servers are all in the same colo,
bandwidth was effectively free. A third is that we simply haven't gotten
around to it.

We'd need a storage-server API to introducer server A and server B, and then
tell A to send a given share to B (this is pretty easy if one of them has a
public IP, and gets considerably harder when both are behind NAT). The repair
process would need to make a decision about when it was ok to replicate and
when it was necessary to encode new shares.. perhaps the first three lost
shares could be addressed by replication, after which new shares must be
generated.

> For example, the probability of losing a file with N=10, k=3, p=.95 and
> four lost shares which have been replaced by duplicates of still-extant
> shares is 9.9e-8, as compared to 1.6e-9 for a proper repair job. Not that
> much worse.

Neat! How did you compute that number?

> If there's storage to spare, the repairer could even direct all six peers
> to duplicate their shares, achieving a file loss probability of 5.8e-10,
> which is *better* than the nominal case, albeit at the expense of consuming
> 12 shares of distributed storage rather than 10.

Yeah, it's a big multivariable optimization/tradeoff problem: storage space
consumed, CPU used, bandwidth used (on links of varying capacities, owned by
different parties), reliability (against multiple sorts of failures). Very
messy :).

cheers,
 -Brian