[tahoe-dev] UncoordinatedWriteError, accidentally migrated shares

Fri Jan 11 05:37:37 UTC 2008

[cc'ed to tahoe-dev because it might be affecting other people]

Peter: I finally understand the issue with that directory that you've been
having problems with. It is currently unwriteable, because of two problems.

The first is described in ticket #269: I didn't pay attention to the recent
move from node.pem to private/node.pem, and when I upgraded the testnet
storage servers (about a week ago), I failed to move the node.pem file. As a
result, most of the storage servers created new ones, giving them new
nodeids. As a result of *that*, most of the write_enablers were invalidated,
which will prevent clients from modifying their mutable slots. Therefore most
of the shares of your directory are now stuck at their current versions.
Eventually we'll implement the share-migration tools to let clients recover
from this sort of thing on their own, but we haven't done that yet.

The second is the lack of recovery code in our mutable slot implementation.
This sort of version skew between shares can happen normally (outside of a
bug like #269), such as when a client or server is interrupted during a write
operation. If this happens in such a way that a small number of new shares
are written (not enough to reconstruct the new version of the file), then all
further attempts to modify the file will fail with an
UncoordinatedWriteError. This is the exception that you were seeing. I've
created ticket #272 to describe this one. I've also added a bunch of new
logging (using foolscap's new logging tools) to capture information about the
problem: the issue was a lot easier to see once those tools were in place.

We've done the design work for both of these, as described in
docs/mutable.txt . We just haven't gotten around to implementing them.

So at the moment, my best advice is to create a new directory and use it
instead. We might want to consider deleting all the shares created before my
bungled upgrade, since until we write that migration code, those shares are
like immutable roadblocks. If we were to fix the shares, we'd still need to
fix the #272 task (i.e. implement mutable-file recovery) before we could
salvage that directory. I think we should do #272 before release; I'm less
worried about share-migration because I think it will be a while before we
need to do it in a production network. On the other hand, share-migration is
easier to write than the recovery code.

cheers,
 -Brian