[tahoe-dev] Low-effort repairs
Shawn Willden
shawn-tahoe at willden.org
Fri Jan 16 06:42:20 UTC 2009
On Thursday 15 January 2009 03:43:41 pm Brian Warner wrote:
> > For example, the probability of losing a file with N=10, k=3, p=.95 and
> > four lost shares which have been replaced by duplicates of still-extant
> > shares is 9.9e-8, as compared to 1.6e-9 for a proper repair job. Not that
> > much worse.
>
> Neat! How did you compute that number?
First, the practice. Here's how to do it using my statistics module:
f1 = binomial_distribution_pmf(1, .95)
tmp = binomial_distribution_pmf(2, .95)
f2 = [ tmp[0], tmp[1]+tmp[2] ]
f = reduce(convolve, [ f1 * 2 + f2 * 4 ])
return sum(f[0:3])
Oh, while I'm at it, the computation for the replicate-all-six case is:
tmp = binomial_distribution_pmf(2, .95)
f2 = [ tmp[0], tmp[1]+tmp[2] ]
f = reduce(convolve, [ f2 * 6 ])
return sum(f[0:3])
Now for the theory:
The random variable K can also be written as sum(K_j for j in range(1,N+1)).
Each K_j is a random variable that counts the number of survivors in a single
share -- either 0 or 1.
If K_1 and K_2 represent the two normal shares, then they each have the simple
probability mass function which I'll call f1:
Pr[K_1=i] = Pr[K_2=i] = f1(i) = [ 1-p, p ]
The PMF of the other four shares is more complex. Each pair is basically a
microcosm of the normal k-of-N problem, with N=2 and k=1. Per the binomial
distribution, the PMF of this case is
Pr[K_j=i] = tmp(i) = [ (1-p)^2, 2p-2p^2, p^2 ]
(if it's not obvious, Pr[K_j=i] is in the ith position of the 0-based list)
But we don't care whether K_j=1 or K_j=2, just that K_j>0. So, we sum those
cases to produce the PMF:
f2(i) = [ (1-p)^2, 2p-p^2 ]
Now we have PMFs for K_1 through K_6, so we just need to find the PMF of their
sum, K so we can calculate Pr[K<k].
According to the discrete convolution theorem, given that the K_j are mutually
independent and K = K_1 + K_2 + ... + K_6, it's the case that
Pr[K=k] = f(k) = (f1*f1*f2*f2*f2*f2)(k)
where '*' is the convolution operator. Because convolution is associative, we
can rewrite that as:
f(k) = (((((f1*f1)*f2)*f2)*f2)*f2)(k)
With Mathematica it's easy enough to do all of the convolutions symbolically,
but they get rather long and nasty. Here's the final f(k):
k=0 (1-p)^10
k=1 -2 (1-p)^8 p (3p-5)
k=2 (1-p)^6 p^2 (15p^2 - 50p + 41)
k=3 -4 (-1 + p)^4 p^3 (-22 + 41 p - 25 p^2 + 5 p^3)
k=4 p^4 (2 - 3 p + p^2)^2 (26 - 40 p + 15 p^2)
k=5 -2 (-2 + p)^3 p^5 (4 - 7 p + 3 p^2)
k=6 (-2 + p)^4 p^6
Those sum to 1, as a PMF must. If you add them and do the algebra you can
confirm it for yourself :)
Finally, to get the odds of losing the file, sum(f(i) for i in range(0,k)).
This approach is really powerful. It even allows you to handle situations
like multiple data centers full of heterogeneous servers (distinct p_i), and
each data center with its own distinct failure probability that impacts all
servers in the center. Assuming you can get good estimates for every one of
those probabilities, you can calculate the PMF of the whole mess. Toss in a
bunch of individual servers as well if you like.
Of course, actually obtaining good estimates of all of those failure
probabilities is probably impossible, but to whatever degree you can get the
estimates, you can build the composite PMF and find the answers.
> > If there's storage to spare, the repairer could even direct all six peers
> > to duplicate their shares, achieving a file loss probability of 5.8e-10,
> > which is *better* than the nominal case, albeit at the expense of
> > consuming 12 shares of distributed storage rather than 10.
>
> Yeah, it's a big multivariable optimization/tradeoff problem: storage space
> consumed, CPU used, bandwidth used (on links of varying capacities, owned
> by different parties), reliability (against multiple sorts of failures).
> Very messy :).
Yep. I'm simplifying the issue of CPU vs. bandwidth by pushing it off on
someone else. I have a notional cost function which computes the putative
cost of a repair operation, taking into account whatever you want it to take
into account. The one I'm playing with ignores CPU and just counts
bandwidth, weighting upstream bandwidth at 8 times the cost of downstream
bandwidth since that's up/down ratio on my cable modem.
That at least reduces the problem to a three-way optimization issue --
storage, repair cost and reliability. I'm thinking I should be able to get
it to a point where you can fix (or bound) two of the three parameters and
optimize for the other.
The attached patch contains the current state of my code and my paper
(docs/lossmodel.lyx).
Shawn.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lossmodel.dpatch.zip
Type: application/x-zip
Size: 68676 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20090115/32c4836d/attachment.bin>
More information about the tahoe-dev
mailing list