[tahoe-dev] Estimating reliability

Fri Jan 9 21:07:28 UTC 2009

On Thursday 08 January 2009 07:56:17 pm Brian Warner wrote:
> The basic operation is "UTBF", upper-tail-binomial-distribution, sum([P("i
> shares remain") for i in range(k,N+1)]) 

That makes perfect sense to me, assuming servers are independent -- which is a 
reasonable assumption for the case I care about:  a backup network of 
geographically-distributed home computers.

> I came up with some approximations, because dealing with numbers that are
> so close to 1.0 wasn't working too well numerically.

That's easy to fix.  Instead of working with numbers that are close to 1, 
which causes stuff to fall off the end of the mantissa, restate the problem 
so the numbers are close to zero and you can use the full precision of the 
mantissa.

So what you want to look at is the lower tail of the PMF, not the upper.  That 
is, you calculate the probability that not enough servers survive and 
subtract from 1 to find the probability of survival.  Specifically, if K is 
the random variable representing the servers that survive, calculate:

	1 - sum(Pr[K=i] for 0 <= i < k)

If you want to be sure you're not losing precision, though, there's another 
way:  Use an arbitrary-precision math library like mpmath.  I did some quick 
testing with Mathematica, using both machine and infinite precision, and I 
found that the error from using machine precision (double) was clear out in 
the 14th decimal place.  Not an issue.  BTW, numbers for k=3, N=10 and a few 
values of p (where p is probability of server survival in period A):

For p = .9, P[file loss in A] = 3E-7
For p = .99, P[file loss in A] = 4E-15
For p = .999, P[file loss in A] = 4E-23
For p = .999999, P[file loss in A] = 5E-47

Extending to m periods is straightforward:

	(1 - sum(Pr[K=i] for 0 <=i <= k-1))^m

Extending to more than one file is not so straightforward, unless there only N 
peers in the network, in which case all files live or die together, or unless 
there are a very large number of peers in the network, such that every file 
has a disjoint set, in which case if n is the number of peers, it's:

	(1 - sum(Pr[K=i] for 0 <=i <= k-1))^nm

> I used decibels, from 
> the EE world, with "10 dBA" (decibels of Availability) meaning a 10% chance
> of failure / 90% chance of success, "20 dBA" meaning a 1% failure / 99%
> success, 30 dBA = 0.1% failure / 99.9% success, etc.

Hmm.  The logarithm function is monotonic, but if you start summing logs, 
you're doing something very different.  I'd have to think about it, but I 
don't think this gives you what you think it gives you.

In any case, I think it shouldn't be an issue to calculate the cumulative LTBF 
to k-1 with floating point numbers, and that gives yout the probability of 
file loss.

Based on this, I think I have everything I need to calculate a reasonable 
estimate of reliabilty/availability (the distinction isn't relevant for small 
networks of people who can call each other to put servers back online) for 
backup networks of geographically-distributed home computers.

Thanks!

	Shawn.