[tahoe-dev] [tahoe-lafs] #200: writing of shares is fragile and "tahoe stop" is unnecessarily harsh (was: writing of shares is fragile)

tahoe-lafs trac at allmydata.org
Mon Nov 2 08:16:25 UTC 2009

#200: writing of shares is fragile and "tahoe stop" is unnecessarily harsh
 Reporter:  zooko         |           Owner:  warner    
     Type:  enhancement   |          Status:  new       
 Priority:  major         |       Milestone:  eventually
Component:  code-storage  |         Version:  0.6.1     
 Keywords:  reliability   |   Launchpad_bug:            

Comment(by warner):

 Hrmph, I guess this is one of my hot buttons. Zooko and I have discussed
 "crash-only" approach before, and I think we're still circling around each
 other's opinions. I currently feel that any approach that prefers
 is wrong. Intentionally killing the server with no warning whatsoever
 the SIGKILL that "tahoe stop" does), when it is perfectly reasonable to
 provide some warning and tolerate a brief delay, is equal to intentionally
 causing data loss and damaging shares for the sake of some sort of
 ideological purity that I don't really understand.

 Be nice to your server! Don't shoot it in the head just to prove that you
 can. :-)

 Yes, sometimes the server will die abruptly. But it will be manually
 restarted far more frequently than that. Here's my list of
 running-to-not-running transition scenarios, in roughly increasing order

  * kernel crash  (some disk writes completed, in temporal order if you're
  * power loss (like kernel crash)
  * process crash / SIGSEGV (all disk writes completed)
  * kernel shutdown (process gets SIGINT, then SIGKILL, all disk writes
    completed and buffers flushed)
  * process shutdown (SIGINT, then SIGKILL: process can choose what to do,
    disk writes completed)

 The tradeoff is between:
  * performance in the good case
  * shutdown time in the "graceful shutdown" case
  * recovery time after something unexpected/rare happens
  * correctness: amount of corruption when something unexpected/rare
    (i.e. resistance to corruption: what is the probability that a share
    survive intact?)
  * code complexity

 Modern disk filesystems effectively write a bunch of highly-correct
 corruption-resistant but poor-performance data to disk (i.e. the journal),
 then write a best-effort performance-improving index to very specific
 (i.e. the inodes and dirnodes and free-block-tables and the rest). In the
 good case, it uses the index and gets high performance. In the bad case
 the fsck that happens after it wakes up and learns that it didn't shut
 gracefully), it spends a lot of time on recovery but maximizes the
 correctness by using the journal. The shutdown time is pretty small but
 depends upon how much buffered data is waiting to be written (it tends to
 insignificant for hard drives, but annoyingly long for removable USB

 A modern filesystem could achieve its correctness goals purely by using
 journal, with zero shutdown time (umount == poweroff), and would never
 any time recovering anything, and would be completely "crash-only", but of
 course the performance would be so horrible that nobody would ever use it.
 Each open() or read() would involve a big fsck process, and it would
 have to keep the entire directory structure in RAM.

 So it's an engineering tradeoff. In Tahoe, we've got a layer of
 over and above the individual storage servers, which lets us deprioritize
 per-server correctness/corruption-resistance goal a little bit.

 If correctness were infinitely important, we'd write out each new version
 a mutable share to a separate file, then do an fsync(), then perform an
 atomic rename (except on platforms that are too stupid to provide such a
 feature, of course), then do fsync() again, to maximize the period of time
 when the disk contained a valid monotonically-increasing version of the

 If performance or code complexity were infinitely important, we'd modify
 share in-place with as few writes and syscalls as possible, and leave the
 flushing up to the filesystem and kernel, to do at the most efficient time

 If performance and correctness were top goals, but not code complexity,
 could imagine writing out a journal of mutable share updates, and somehow
 replaying it on restart if we didn't see the "clean" bit that means we'd
 finished doing all updates before shutdown.

 So anyways, those are my feelings in the abstract. As for the specific, I
 strongly feel that "tahoe stop" should be changed to send SIGINT and give
 process a few seconds to finish any mutable-file-modification operation it
 was doing before sending it SIGKILL. (as far as I'm concerned, the only
 reason to ever send SIGKILL is because you're impatient and don't want to
 wait for it to clean up, possibly because you believe that the process has
 hung or stopped making progress, and you can't or don't wish to look at
 logs to find out what's going on).

 I don't yet have an informed opinion about copy-before-write or
 edit-in-place. As Zooko points out, it would be appropriate to measure the
 costs of writing out a new copy of each share, and see how bad it looks.

  * the simplest way to implement copy-before-write would be to first copy
    entire share, then apply in-place edits to the new versions, then
    atomically rename. We'd want to consider a recovery-like scan for
    abandoned editing files (i.e.
    {{{find storage/shares -name *.tmp |xargs rm}}}) at startup, to avoid
    unbounded accumulation of those tempfiles, except that would be
    to perform and will never yield much results.

  * another option is to make a backup copy of the entire share, apply
    in-place edits to the *old* version, then delete the backup (and
    a recovery procedure that looks for backup copies and uses them to
    the presumeably-incompletely-edited original). This would be easier to
    implement if the backup copies are all placed in a single central
    directory, so the recovery process can scan for them quickly, perhaps

 However, my suspicion is that edit-in-place is the appropriate tradeoff,
 because that will lead to simpler code (i.e. fewer bugs) and better
 performance, while only making us vulnerable to share corruption during
 rare events that don't give the server time to finish its write() calls
 kernel crash, power loss, and SIGKILL). Similarly, I suspect that it is
 appropriate to call fsync(), because we lose performance everywhere but
 improve correctness in the kernel crash and power loss scenarios. (a
 kernel shutdown, or arbitrary process shutdown followed by enough time for
 the kernel/filesystem to flush its buffers, would provide for all write()s
 be flushed even without a single fsync() call).

Ticket URL: <http://allmydata.org/trac/tahoe/ticket/200#comment:5>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid

More information about the tahoe-dev mailing list