[tahoe-dev] repairer work

Fri Jul 11 21:56:02 UTC 2008

Zooko and I are working on Repairer stuff this month. Here's a general
overview of the approach we're taking:

 * enhance ImmutableFileNode and MutableFileNode (and DirectoryNode) to
   have a new check() method:

     node.check(verify=bool, repair=bool)

   This will perform a "file check" operation: ask all servers about the
   storage index and count the number of shares. If we see at least N
   shares, the file is "healthy", otherwise it is "unhealthy".

   If verify=True, then in addition to asking (and believing) about the
   existence of shares, we also download the full share and verify all its
   hashes. This ought to validate every single bit of the share, giving us
   the chance to detect uncorrected/uncaught disk errors, memory errors, etc.
   Any integrity failures we see is a big deal for the operations staff: they
   need to examine the server in question and consider removing it from
   service. Using verify=True will take a considerable amount of disk IO and
   network bandwidth, equal to the work performed by the original upload.

 * Also add a repair() method to those instances

   If repair() is called, or if check(repair=true) found the file to be
   unhealthy, then an immediate repair operation is performed. This downloads
   the file, re-encodes it, and uploads enough new shares to bring the total
   number back up to N. (note that this will eventually interact with
   accounting: who gets charged for the new shares, the original uploader
   or the repairer? we're putting off this question until we get back to
   accounting)

 * adding a ?t=deep-check-and-repair operation to the webapi

   Like the existing t=deep-size and t=deep-stats operations, this will
   recursively traverse all files and directories reachable from some
   starting point. But instead of merely adding up the size of the files
   encountered, t=deep-check-and-repair will perform a file check on
   everything by calling node.check(verify=False, repair=True) . A
   corresponding t=deep-verify-and-repair operation will use verify=True .

   My rough hunch is that deep-check will take about 6 times as long as
   deep-size/deep-stats, since it must examine immutable files in addition to
   dirnodes, and our prodnet grid currently has about 150k dirnodes and 800k
   files. My even rougher hunch is that deep-verify will take about 10-100
   times as long as deep-check, since it must read every byte of every file.

 * start running t=deep-check-and-repair periodically

   AllMyData runs a t=deep-size operation on all customer rootcaps on a daily
   basis, to measure how much space each user is consuming. This operation
   currently takes 6 hours to complete (using a single webapi node cranking
   at 90% CPU, running at about 10 dirnodes per second, inducing perhaps a 5%
   CPU load on each storage server). We plan to switch this to
   t=deep-check-and-repair (which would emit the same stats as t=deep-stats,
   including total size), with careful instrumentation to see what files are
   in need of repair.

   Since we have not yet experienced any disk failures in the production
   grid, we wouldn't expect to see many files that need repair.. perhaps a
   few that were unlucky enough to be uploading when a storage server was
   bounced, but none due to share death. But we need to confirm this
   expectation.

   We'd like to do at least one run of deep-verify to measure how long it
   will take. The lower bound is the time necessary to read every byte out of
   a 1TB hard drive. With latencies and roundtrips and such, it will
   certainly take some multiple of that, perhaps 10x-100x. Running multiple
   webapi servers in parallel will help, up to the point that we consume all
   of the storage server's CPU or disk bandwidth (deep-size takes about 5% of
   a CPU right now, so I'm guessing we might be able to run at most 20 webapi
   nodes in parallel before hitting this limit). However, we need to avoid
   using up all of the storage server bandwidth, to leave some for actual
   customer use. Throttling the checker/verifier may be necessary to
   accomplish this.

   But in the long run, I expect deep-verify to be too expensive (and the
   probability of undetected bit flips on a storage server disk too low), so
   I think we'll use periodic deep-check instead, perhaps once a week. Maybe
   daily deep-size and weekly deep-check.

We need to build some reporting channels into this process: automatic repair
is the best policy to improve file health, but even more important is to find
out which disk is showing errors (because that will threaten lots of other
shares). I think we need something more reliable than having the user who ran
deep-check copy their error message out to an email: that user has no strong
incentive to report the problem to us. (and eventually we want every download
that discovers a problem to be able to trigger a repair). So we're thinking
about a foolscap-based publish/subscribe scheme, in which nodes are equipped
with a "repairer.furl" or "problems.furl", and every time they see problems,
they report it in a message to that object. We can then write backend tools
that count the problems associated with any given server and give the ops
folks a heads up.

There are several other (cheaper) ways to go about checking/repair. Periodic
full-verify with automatic repair is the most complete, but also the most
expensive (in terms of bandwidth consumed.. each full deep-verify run will
read every single bit in the entire grid). If we were willing to assume that
shares don't go breaking one-at-a-time on their own, then we could skip the
periodic check and instead just wait for a hard drive to die (an operator
would push the "server #3 died" button after seeing the disk errors pop out
on the kernel log). If we had a list of all the shares that were present on
that drive before it died (a non-trivial but solveable task), then we could
initiate repair work on that list. We could even maintain a big database of
which share was stored where, and use that to count the remaining shares:
then rather than repairing upon the first scratch, we could wait until the
file was actually in danger to do the repair (say, 5 shares left). This
"repair threshold" could be tuned to balance the amortization rate of repair
work (repairs per unit time) against the desired reliability (probability
that you'd lose the remaining necessary shares before repair finished). This
technique could cut our repair work by a factor of five relative to the
instant-repair model.

To support this, we'll eventually have an ImmutableFileVerifierNode and
MutableFileVerifierNode which provide check() but not download(), and are
created from a storage index (or from a verifier-cap). These objects would be
used for server-driven repair, where read-caps are not available.

Periodic check+repair has some interesting statistical modelling challenges.
I've put some work into jamming some numbers into this model (the
provisioning page contains some of the results), but I'm not particularly
confident about it yet. If there is a queueing theory / operations research
grad student out there who's looking for a theis topic, I'll buy you lunch in
exchange for some analysis work :).

Anyways, I hope that gives folks an idea of where our repair work is heading.
periodic-check and instant-repair is the starting point, since once we have
that we know the data will be safe (even if the large-grid repair costs are
higher than we're comfortable with.. it may be completely fine for smaller
friendnet grids). After that, we'll figure out ways to make it cheaper.

cheers,
 -Brian