[tahoe-dev] mutable file and directory safety in allmydata.org "tahoe" 0.9

Brian Warner warner-tahoe at allmydata.com
Wed Mar 12 16:45:39 UTC 2008

Zooko: thanks for the excellent summary!

> 1.  Fix the ?t=uri and ?t=set_children methods and the DELETE method  
> to check the directory's version number, do delta application of the  
> requested change, upload the new version of the directory, catch  
> UncoordinatedWriteError, re-download, re-apply the requested change,  
> upload, etc. until the upload succeeds.  (To understand more about  
> this idiom, read on.)

I propose to fix this (today) in the following way:

 implement IMutableFileNode.update() and IMutableFileNode.replace().
 replace() is an explicit overwrite (and identical to our current replace
 method): you use this when you don't care about the old contents of the
 file, for example when you've uploaded a file with the "Mutable?" checkbox
 turned on, and now you want to modify it in place. (note that we have only
 limited webapi/cli support for this operation right now).

 update() is used by "delta" operations (in which you expect to make a
 modification to an existing data structure that is stored in the mutfile).
 All dirnode modifications will use delta operations. These should be done
 as follows:

  mutfile = client.create_node_from_uri(u)
  d = mutfile.download_to_data()
  return d

 This will raise UncoordinatedWriteError if someone else has touched the
 mutable file after the middle of the download_to_data() call but before the
 update() call completes. In this case the caller (e.g. dirnode.add_child)
 can choose to return the failure to their own caller (i.e. tell the user
 that they violated the Prime Coordination Directive), or they may choose to
 retry after a suitable randomized delay:

  mutfile = client.create_node_from_uri(u)
  d = mutfile.download_to_data()
  def _retry(f):
      d = defer.Deferred()
      reactor.callLater(random.uniform(5, 20), d.callback, None)
      d.addCallback(lambda res: mutfile.download_to_data())
  return d

 Enthusiastic callers may wish to retry multiple times, using an exponential
 backoff algorithm. Such callers may also wish to give up eventually.

 Note that callers are not obligated to retry: the addition of retry code
 will merely reduce the frequency of UCW errors that are exposed to the
 higher-level caller.

Note that if an UCW is returned, the state of the file is indeterminate. The
present codebase (which lacks #272 recovery) will always leave some mixture
of shares (a mix of all simultaneous writers), which is not as healthy as we
would like: we would prefer all 'N' shares to be for the same version. This
only threatens recoverability if network delays cause writers to deliver
their changes in a non-linear order, and if there are a lot of simultaneous
writers (given our 3-of-10 encoding, it takes 4 uncoordinated writers to
clobber the file). A successful write that occurs later will restore the file
to full health. I do not believe that multiple UCWs in a row will compound
the problems: unless the writers are interrupted in the middle of their
update messages, all shares should get replaced by newer versions.

Once we implement #272, then as long as there is at least one surviving
writer (i.e. not all writers lose their network connection in the middle of
the update), the UCW will leave the file in a healthy state (all N shares for
the same version). The actual version is still indeterminate: it might be the
old one, it might be the new one that you tried to upload, it might be
somebody else's new one. That's the penalty for violating the coordination

I recommend that the dirnode delta operations (add_children() and friends)
*not* attempt to perform retry at this time. We can make writes safe from
this blind overwrite bug by implementing update(), but continue to treat UCW
as a user error and not feel an urgent need to protect the user from it. I
believe that UCW will be rare enough for the next month that we don't need to
go out of our way to hide them.

> We already know of ways that, depending on how many writers and how  
> often they write, and depending on when the network connections or  
> the clients or the servers crash, that using Tahoe that way can  
> silently lose data.

To be clear, the kind of data loss that we've been worrying about (caused by
having so many simultaneous writers that we wind up with fewer than 'k'
shares per version) is not silent. All of the writers involved will deliver a
UCW error to their callers. Well, ok, assuming we implement #272 recovery,
the remaining threat is when all of those writers die during the publish
process. So I suppose it could be called "anthropomorphically silent": an
error is raised, but nobody who has seen it gets left alive :-).

> Argh.  Folks: I just went to implement "robust application of  
> set_children", as per #1 above, and discovered *two* previously  
> unknown ways that multiple uncoordinated writes to a directory can  
> cause silent data loss.

Could you describe these two new problems?

> and for users (i.e. allmydata.com) to make sure that they don't so  
> that.  I will talk to Mike Booker to be sure, but I'm pretty sure  
> that allmydata.com can easily enough avoid uncoordinated writes in  
> the Allmydata 3.0 product, simply by having few or no shared- 
> writeable directories, or by creating a simple centralized lock  
> server when necessary.

For the benefit of the non-allmydata folks: we haven't yet implemented
directory sharing in the .com product (and when we do, we're planning to use
directed pairwise one-reader-one-writer directories, which doesn't suffer
from this concern because it doesn't give a write-cap to the recipient). So
the main concern right now is a user who has an automated backup process
writing a lot of data into a directory, at the same time that this user using
a web browser (on a different tahoe node) to modify those backup directories.

As long as we continue in this approach (i.e. *not* taking advantage of
tahoe's easily-shareable directory capabilities), then a per-account lock
(respected by both the FUSE plugin and all web frontends) will be sufficient
to completely avoid UCWs.

> P.S.  Someday someone might search history for instances of the term  
> "LAUGFS", which stands for "Least AUthority Grid File System".   
> Hello, there, searcher from the future!

Or maybe even "laughfs" :). FYI, "tahoefs", "tahoe filesystem", "allmydata
tahoe", "tahoe python" all put us in the top two results on google.
Unsurprisingly, "tahoe storage" and "tahoe protocol" do not.


More information about the tahoe-dev mailing list