[tahoe-dev] Kragen Sitaker, Brian Warner, Josh Wilcox, Bram Cohen, Russ Roberts, and Nassim Taleb on erasure coding probabilities
zooko at zooko.com
Wed Sep 26 17:56:51 UTC 2007
So I chatted with Kragen Sitaker on IRC last week, and asked him if
he knew of a way to get error bounds on
SUM for i=0..k ( choose(i, n) * P^i * (1-P)^(n-i) )
This is the formula for the reliability of a file, if the file is
encoded with K-out-of-N erasure coding, and each share has a
reliability of P.
Brian Warner has already calculated some approximations of this
formula for some K,N,P that we care about (see below), but I was
uncomfortable with those approximations because I didn't know their
error bounds. (Note that Josh Wilcox has already posted  a
recursive relation which might prove useful for computing this value
precisely using rationals, but I was curious if Kragen knew of
Kragen said something along the lines of: "This is assuming that the
reliability of the N servers are independent, right? I'm not sure
that this is a fair assumption."
I replied that this is a good point, and I mentioned how I rather
agreed with Bram Cohen's quip that "Engineering for more than six
nines is nothing but intellectual masturbation" . (Although
personally I'm more comfortable with five nines. I can engineer for
one-in-a-hundred-thousand events more confidently than for one-in-a-
Please see Bram's post  and Kragen's follow-up . Bram's reply
to Kragen is incomplete: Bram doesn't explain why servers with less
than 0.95 reliability can't be confidently analyzed, or why they are
more likely to have correlated failure than servers with 0.99
reliability, and those two assertions are not intuitively obvious to me.
Kragen said to me on IRC last week that Nassim Taleb has recently
written a good book arguing that this kind of modelling is wrong. I
haven't read those books yet, and I didn't understand Kragen's point
about this. Maybe Kragen wants to follow-up to this and explain?
Also maybe Bram could follow-up to this and explain his intriguing
assertions from his livejournal.
Here is a podcast in which economist Russ Roberts interviews Nassim
Taleb about his notion of "black swans" and "mediocristan vs.
extremistan". It's very interesting, and parts of it are quite
relevant to the current issue of how to model reliability: .
Currently I believe that we can reasonably use the above formula to
compute things with up to five nines. We can answer questions such as:
* If K=3 and N=10, then what is the server reliability P above
which the file reliability will be >= 0.99999?
I do not think we can correctly use this formula to answer some other
questions that we really want to answer, and that we are currently
using this formula to answer. See the otherwise excellent
provisioning tool that comes with Tahoe thanks to Brian Warner --
it's on the welcome page -- press the button labelled "Compute", and
see the output labelled "availability of ... all files in grid". The
value is an attempt to answer the question:
* If K=3 and N=10, and P=0.99, and there are 50,000,000 files, then
what is the probability that every one of the 50,000,000 files will
There are three reasons why the above formulation does not apply to
the latter question. First, the reliability of servers is not
independent. Second, the reliability of files is not independent.
Third, the accuracy of the solution to the per-file reliability is
unknown, and it is possible that the error in that number gives rise
to critical error in the computation of the for-all-files reliability.
So I currently believe that the answers that we have produced to the
second question (in the provisioning tool) are not justified and
should not be relied upon or offered as fact to customers/users/
Of course, using the formula to answer the first question also runs
afoul of the fact that server reliability is not independent. We can
resolve this contradiction by either:
1. Asserting that the chance of correlated failures such as arson at
a co-lo or a critical bug in the server software is so much less than
1 in 100,000 that it doesn't invalidate our answer to the first
2. Explicitly setting out those correlated failures in our answer,
e.g.: "Assuming that the only kind of failure is a server failure --
e.g. an IDE hard drive crash -- which is uncorrelated with the
failures of other servers, then the answer is BLAH." This means that
the answer BLAH is not valid in the event of fire, flood, war,
terrorism, act of god, penetration by a hacker, or mistake on the
part of the Tahoe programmers. There is a good reason that insurance
contracts come with that sort of language.
Explicitly setting out those causes is a legitimate approach to both
of the questions listed above. Asserting that the cumulative
probability of such causes is insignificant is not a legitimate
approach to the second question listed above, because to compute an
answer to the second question in the way that we currently do it
requires some intermediate values which have astronomical precision
in them, such as the "availability of ... arbitrary file" field in
the provisioning tool, which has fourteen nines!
By the way, this analysis explains one of my operational quirks --
for a production environment I prefer for the software running on the
servers to be heterogeneous rather than homogeneous. This is of
course the opposite of what actual operations guys want, because it
makes it harder for them to mentally model and to manage the whole
farm of servers, but hopefully this discussion explains the appeal of
it: it reduces the correlation between software bugs present in
different servers. I think that the chance of critical bugs in
software is much higher than the chance of arson at the co-lo, and I
suspect that the chance of critical bugs in software may be high
enough that it invalidates our five nines.
Note: this is not to denigrate the quality of the Tahoe software.
The quality is actually much higher than the average for modern
software, in large part due to the extensive unit tests and the
attendant programming practices. Unfortunately, while I'm confident
in saying that the quality of Tahoe is much higher than the average
for modern software, I'm still not comfortable saying that there is
less than a 1 in 100,000 chance of a critical bug. With further
quality assurance processes such as code reviews, even more tests,
operational experience, and "static" software (i.e., software that
hasn't been changed in a long time or after a long series of quality
assurance tests/reviews), then I could become more confident in that
number. Certainly the clean, well-documented, well-tested, well-
structured source code that currently makes up Tahoe is a much better
candidate for that kind of process than most codebases.
More information about the tahoe-dev