[tahoe-dev] Sighting reports

Tue Jul 13 21:10:19 UTC 2010

At 2010-07-13 15:36 (-0400), Kyle Markley wrote:

> Hey developers,
> 
> I've been putting my 4-node grid through some stress and I've encountered
> a few problems I wanted to report.
> 
> 1) Sometimes I get backup operations failing like this:

The part of Tahoe-LAFS that is failing is the part that figures out
where shares should be placed. What that does is ask each storage server
if it will eventually hold some of the shares that will be generated when
the file is encoded. The storage server will check its available space
to make sure that it can hold all of the shares that it is asked to
hold, and will refuse to hold shares that it does not have space for. It
then tells the client which shares it will hold, and which it is already
holding.

The upload code in the client concluded that your storage server was
full because the storage server refused to hold one or more of the
shares that the client asked it to. This doesn't necessarily mean that
the storage server is actually full (so maybe that error message should
be reworded to say "of which 1 placed none due to the server not having
enough free space", or something like that), only that the storage
server is unable to accept a share of the size that your upload would
generate.

(I have opened bug #1116 [1] for the error message)

>From the error message, and from a message you sent to the list before,
I gather that you're using 2-of-4 encoding. Is that right? If so, each
share generated from a particular source file will be about half the
size of the source file. Does this happen with any particular files? If
not, and if you notice it happening again, compare the source file size
to the amount of free space available to Tahoe-LAFS on your storage
servers -- if one of the servers has less free space available to
Tahoe-LAFS than about half the size of the source file, then the storage
server is probably right to reject the share, and the client is probably
right to abort the upload.

> This error report is incorrect -- all of the storage nodes show on their
> status pages that they are still accepting new shares!  Further, I've seen
> that if I keep trying to restart the backup, the storage situation degrades
> until eventually it says that all 4 shares couldn't be placed due to the
> server being full.  If I restart the tahoe node trying to run the backup,
> this problem goes away, at least for a while.

When a storage server accepts responsibility for a share during peer
selection, it makes a placeholder file of the same size as it was asked
to store. This means that the new share will be accounted for in future
space accounting even if it hasn't been written yet. Unfortunately, it
seems that the peer selection code doesn't tell the storage server that
it won't be using the space that it allocated earlier when it fails, so
the storage server fills up a little bit every time you try and fail to
upload. You notice that it works again because the unused share file
gets deleted (I think) when you restart the Tahoe-LAFS node, since the
storage server at that point notices that you've disconnected and
deletes the share file without being told to.

This is a bug, and I've opened #1117 [2] to fix it. 

> 2) A long tahoe backup aborted with this error:
[...]
> assert len(buckets) == sum([len(peer.buckets) for peer in used_peers])

I've opened #1118 [3] to examine this issue. I think that that assert
isn't worded quite right, since it doesn't consider the possibility that
we might have allocated more buckets than we intend to use.

Thanks for the reports,
-- 
Kevan Carstensen | <kevan at isnotajoke.com>

[1] http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1116
[2] http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1117
[3] http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1118