[tahoe-dev] Kragen Sitaker, Brian Warner, Josh Wilcox, Bram Cohen, Russ Roberts, and Nassim Taleb on erasure coding probabilities

Wed Sep 26 17:56:51 UTC 2007

Folks:

So I chatted with Kragen Sitaker on IRC last week, and asked him if  
he knew of a way to get error bounds on

SUM for i=0..k ( choose(i, n) * P^i * (1-P)^(n-i) )

This is the formula for the reliability of a file, if the file is  
encoded with K-out-of-N erasure coding, and each share has a  
reliability of P.

Brian Warner has already calculated some approximations of this  
formula for some K,N,P that we care about (see below), but I was  
uncomfortable with those approximations because I didn't know their  
error bounds.  (Note that Josh Wilcox has already posted [1] a  
recursive relation which might prove useful for computing this value  
precisely using rationals, but I was curious if Kragen knew of  
another technique.)

Kragen said something along the lines of: "This is assuming that the  
reliability of the N servers are independent, right?  I'm not sure  
that this is a fair assumption."

I replied that this is a good point, and I mentioned how I rather  
agreed with Bram Cohen's quip that "Engineering for more than six  
nines is nothing but intellectual masturbation" [2].  (Although  
personally I'm more comfortable with five nines.  I can engineer for  
one-in-a-hundred-thousand events more confidently than for one-in-a- 
million.)

Please see Bram's post [2] and Kragen's follow-up [3].  Bram's reply  
to Kragen is incomplete: Bram doesn't explain why servers with less  
than 0.95 reliability can't be confidently analyzed, or why they are  
more likely to have correlated failure than servers with 0.99  
reliability, and those two assertions are not intuitively obvious to me.

Kragen said to me on IRC last week that Nassim Taleb has recently  
written a good book arguing that this kind of modelling is wrong.  I  
haven't read those books yet, and I didn't understand Kragen's point  
about this.  Maybe Kragen wants to follow-up to this and explain?   
Also maybe Bram could follow-up to this and explain his intriguing  
assertions from his livejournal.

Here is a podcast in which economist Russ Roberts interviews Nassim  
Taleb about his notion of "black swans" and "mediocristan vs.  
extremistan".  It's very interesting, and parts of it are quite  
relevant to the current issue of how to model reliability: [4].

Currently I believe that we can reasonably use the above formula to  
compute things with up to five nines.  We can answer questions such as:

  * If K=3 and N=10, then what is the server reliability P above  
which the file reliability will be >= 0.99999?

I do not think we can correctly use this formula to answer some other  
questions that we really want to answer, and that we are currently  
using this formula to answer.  See the otherwise excellent  
provisioning tool that comes with Tahoe thanks to Brian Warner --  
it's on the welcome page -- press the button labelled "Compute", and  
see the output labelled "availability of ... all files in grid".  The  
value is an attempt to answer the question:

  * If K=3 and N=10, and P=0.99, and there are 50,000,000 files, then  
what is the probability that every one of the 50,000,000 files will  
be available?

There are three reasons why the above formulation does not apply to  
the latter question.  First, the reliability of servers is not  
independent.  Second, the reliability of files is not independent.   
Third, the accuracy of the solution to the per-file reliability is  
unknown, and it is possible that the error in that number gives rise  
to critical error in the computation of the for-all-files reliability.

So I currently believe that the answers that we have produced to the  
second question (in the provisioning tool) are not justified and  
should not be relied upon or offered as fact to customers/users/ 
investors/etc..

Of course, using the formula to answer the first question also runs  
afoul of the fact that server reliability is not independent.  We can  
resolve this contradiction by either:

1.  Asserting that the chance of correlated failures such as arson at  
a co-lo or a critical bug in the server software is so much less than  
1 in 100,000 that it doesn't invalidate our answer to the first  
question, or

2.  Explicitly setting out those correlated failures in our answer,  
e.g.: "Assuming that the only kind of failure is a server failure --  
e.g. an IDE hard drive crash -- which is uncorrelated with the  
failures of other servers, then the answer is BLAH."  This means that  
the answer BLAH is not valid in the event of fire, flood, war,  
terrorism, act of god, penetration by a hacker, or mistake on the  
part of the Tahoe programmers.  There is a good reason that insurance  
contracts come with that sort of language.

Explicitly setting out those causes is a legitimate approach to both  
of the questions listed above.  Asserting that the cumulative  
probability of such causes is insignificant is not a legitimate  
approach to the second question listed above, because to compute an  
answer to the second question in the way that we currently do it  
requires some intermediate values which have astronomical precision  
in them, such as the "availability of ... arbitrary file" field in  
the provisioning tool, which has fourteen nines!

By the way, this analysis explains one of my operational quirks --  
for a production environment I prefer for the software running on the  
servers to be heterogeneous rather than homogeneous.  This is of  
course the opposite of what actual operations guys want, because it  
makes it harder for them to mentally model and to manage the whole  
farm of servers, but hopefully this discussion explains the appeal of  
it: it reduces the correlation between software bugs present in  
different servers.  I think that the chance of critical bugs in  
software is much higher than the chance of arson at the co-lo, and I  
suspect that the chance of critical bugs in software may be high  
enough that it invalidates our five nines.

Note: this is not to denigrate the quality of the Tahoe software.   
The quality is actually much higher than the average for modern  
software, in large part due to the extensive unit tests and the  
attendant programming practices.  Unfortunately, while I'm confident  
in saying that the quality of Tahoe is much higher than the average  
for modern software, I'm still not comfortable saying that there is  
less than a 1 in 100,000 chance of a critical bug.  With further  
quality assurance processes such as code reviews, even more tests,  
operational experience, and "static" software (i.e., software that  
hasn't been changed in a long time or after a long series of quality  
assurance tests/reviews), then I could become more confident in that  
number.  Certainly the clean, well-documented, well-tested, well- 
structured source code that currently makes up Tahoe is a much better  
candidate for that kind of process than most codebases.

Regards,

Zooko

[1] http://allmydata.org/pipermail/tahoe-dev/2007-September/000133.html
[2] http://bramcohen.livejournal.com/1416.html
[3] http://bramcohen.livejournal.com/1416.html?thread=6536#t6536
[4] http://www.econtalk.org/archives/2007/04/taleb_on_black.html