[tahoe-dev] erasure coding makes files more fragile, not less

Tue Mar 27 19:18:08 UTC 2012

(following-up to my own post)

Tushar Chandra, Robert Griesemer, and Joshua Redstone from Google made
a very similar point in their widely cited paper "Paxos Made Live":

http://scholar.google.com/scholar?q=paxos+made+live&hl=en&btnG=Search&as_sdt=1%2C6&as_sdtp=on

Here's the relevant excerpt:

"""
In closing we point out a challenge that we faced in testing our
system for which we have no systematic solution. By their very nature,
fault-tolerant systems try to mask problems. Thus they can mask bugs
or configuration problems while insidiously lowering their own
fault-tolerance. For example, we have observed the following scenario.
We once started a system with five replicas, but misspelled the name
of one of the replicas in the initial group. The system appeared to
run correctly as the four correctly configured replicas were able to
make progress. Further, the fifth replica continunously ran in
catch-up mode and therefore appeared to run correctly as well. However
in this configuration the system only tolerates one faulty replica
instead of the expected two. We now have processes in place to detect
this particular type of problem. We have no way of knowing if there
are other bugs/misconfigurations that are masked by fault-tolerance.
"""

My take-away is that the more powerful your fault-tolerance technology
is, the more powerful you need your monitoring technology to be. I
think Tahoe-LAFS as it currently exists ships with much more powerful
fault-tolerance than monitoring, which makes it dangerous unless the
user brings their own monitoring. A lot of the issue tickets with the
keyword "transparency" are about making more built-in, automatic, and
user-visible monitoring:

https://tahoe-lafs.org/trac/tahoe-lafs/query?status=!closed&keywords=~transparency&order=priority

Regards,

Zooko