out of shares error

Sat Mar 22 23:58:23 UTC 2014

"Zooko O'Whielacronx" <zookog at gmail.com> writes:

> On Sat, Mar 22, 2014 at 12:54 PM, Greg Troxel <gdt at ir.bbn.com> wrote:
>>
>> I just rebuilt tahoe-lafs from pkgsrc with updated py-OpenSSL and the
>> newly vast set of dependencies for it.   So far everything seems to be
>> ok.  I ran a deep-check --add-lease over an alias with about 5500
>> objects (and I believe all is healthy).
>>
>> After a long time of 'NN00 objects checked', I got:
>>
>> 5500 objects checked..
>> ERROR: NotEnoughSharesError(ran out of shares: complete=
>>     pending=Share(sh9-on-[redacted]),Share(sh3-on-[redacted]) overdue=
>>     unused= need 3. Last failure: None)
>> "[Failure instance: Traceback (failure with no frames): <class
>>     'allmydata.interfaces.NotEnoughSharesError'>: ran out of shares:
>>     complete= pending=Share(sh9-on-[redacted]),Share(sh3-on-[redacted])
>>     overdue= unused= need 3. Last failure: None"
>>
>> and I'm wondering if that's running out of memory in the tahoe process,
>> vs a remote filesystem error.
>
> Hrm, this is confusing. Is it saying that it finished checking all
> 5500, and then… *after* it finished checking them it reported this
> error?
>
> Could you run a "deep-check" operation in the WUI instead of the CLI
> and see if that gives more understandable results?

There are actually 5617 objects, so it failed near the end but not
quite.  I did another run with the CLI and the -v option on both
netbsd/i386 (and limited memory space at 2G data size) and also on
osx/x86_64, and both completed fine.

As a nit, it seems that the "100 objects checked" is printed before the
-v line saying it was healthy, or there's a fencepost error (99
'healthy' lines before "100 objects checked").

This error is in 

./src/allmydata/immutable/downloader/fetcher.py:
            format = ("ran out of shares: complete=%(complete)s"

and it seems to be transient due to not fetching shares that were
expected.  I wasn't able to absorb the logic well enough to really
understand.

So the only real bug is that probably only that file should get EIO, and
the repair process should not exit (although perhaps that was the last
file to be checked).  Certainly a python exception should not be
displayed to the user, but that's a general style issue.

This makes me wonder about the old nfs hard vs soft mounts, too.  But
generally that's a no-win situation, and the only solution is to have a
reliable network.

My server situation is 4 servers, 3/10 encoding, and most files have
4/3/3 over 3 servers with the fourth server empty (because it had a disk
failure and I repaired with 3, and then added the fourth one's disk
back).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 180 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20140322/d4391487/attachment.asc>