[tahoe-dev] Estimating reliability

Brian Warner warner-tahoe at allmydata.com
Tue Jan 13 07:33:48 UTC 2009


> This makes our user-tunable parameters directly expressive of what
> the user cares about, rather than these arcane values which require
> understanding of the way the system works, rather than what the user
> wants it to do.

Excellent goal! I am prone to forgetting that.

> Yep, that's what the chain matrix will look like.  One thing that
> occurs to me is that with variable p_i, you can't really say that the
> probability of any file that starts with 8 shares has a probabilty x
> of ending with 6, because the probability depends on the p_i of the
> specific 8 peers holding the shares.  I can think of a couple of ways
> to address that.  Hmm.

Yeah, that was one of the reasons I went with the assumption of uniform
behavior. You might be able to handwave something about how, although server
A behaves differently than server B, since we don't entirely know whether the
share will land on A or B, you basically average them all together.

Incidentally, the very first peer-selection design we started with ("Tahoe1",
as opposed to the current "Tahoe2" algorithm, or the "Tahoe3" that we backed
away from about a year ago) assigned "reliability points" to each server
(based upon a hypothetical long-term measurement of each server), and
assigned shares in permuted order until enough points had been accumulated.
One of the reasons we gave up on it was that I didn't believe we could
actually measure server reliability in any meaningful+accurate way.

> >  We'll do repair whenever it is desired and possible (i.e. k<=i<R),
> > so we construct a second "post-repair" matrix Mpr, in which e.g.
> > Mpr(start=10, finish=10) = sum([Md(start=10,finish=i) for i in
> > range(k,R)+[N]]).
> 
> I'm not following your notation here.

What I meant was that the post-repair chain matrix is derived from the
previous "decay" chain matrix by identifying all of the states that would
provoke a file-repair operation, and moving those probabilities over into the
post-repair N-shares cell. If k=3, R=7, N=10, and we end the period with 7
shares, we'll start the next period with 7 shares. But if we end the period
with 6 shares, we'll do a repair, so we'll start the next period with 10
shares. If we end the period with 2 shares, we'll start the next period with
2 shares, because we've lost too many shares to do a repair.

Let's say the (i,j)'th cell of the Md matrix represents the conditional
probability that, given "i" shares at the start of the period, there will be
"j" shares at the end of the period. For each row with constant "i", the j=0
cell represents complete loss of all shares, the j=k-1 cell is the
best-but-still-not-good-enough case, the j=k cell is the
just-barely-good-enough case, and the j=i cell is the
haven't-lost-anything-this-period case. Clearly all cells of Md where
j>i will have probability zero: the grid won't spontaneously create new
shares.

Now Mpr is derived from Md with the following algorithm:

 def make_Mpr(Md):
   for i in range(N+1):
     p_repaired = 0.0
     for j in range(0, k):
       Mpr[i,j] = Md[i,j] # no repair: dead file
     for j in range(k, R):
       # this range triggers repair, so we'll have N shares, not j
       Mpr[i,j] = 0.0
       p_repaired += Md[i,j]
     for j in range(R, N):
       Mpr[i,j] = Md[i,j] # no repair: healthy-enough file
     Mpr[i][N] = Md[i][N] + p_repaired
   return Mpr

> It's also not clear to me now how I thought I could do it with the
> threshold-based scheme. Apparently I needed a larger margin to write in. If
> R=N, it's easy.

Heh.

I may throw together a program to try this, although I'm not convinced that
the numerical accuracy of regular floats will cut it. A lot of these numbers
will be very close to 1.0. Hm, maybe I'll see what the 'decimal' module can
do for me..


cheers,
 -Brian



More information about the tahoe-dev mailing list