[tahoe-dev] Manual rebalancing in 1.10.0?

Sat Sep 28 21:17:12 UTC 2013

Reading the text of 1382, it isn't clear to me whether it's expected to 
address the error from "tahoe put" at all.  (It only mentions check, 
verify, and repair.)  And although it mentions repair, in my scenario 
all the shares are present on the grid, so the file doesn't need 
"repair" at all... it *only* needs rebalancing.

Should I file my scenario in a new ticket?  Or is it actually intended 
to be covered by 1382?
Did I test the new code from 1382 correctly?

On 09/22/13 09:46, Kyle Markley wrote:
> Mark Berger, et al,
>
> I (believe I have) tried my scenario with your code, and it doesn't 
> fix the behavior I have been seeing.
>
> Given a file on the grid for which all shares exist, but which needs 
> rebalancing, "tahoe put" for that same file will fail.  (And "tahoe 
> check --repair" does not attempt to rebalance.)
>
> This is what I did.  I'm a git novice, so maybe I didn't get the right 
> code:
> $ git clone https://github.com/markberger/tahoe-lafs.git
> $ cd tahoe-lafs/
> $ git checkout 1382-rewrite
> Branch 1382-rewrite set up to track remote branch 1382-rewrite from 
> origin.
> Switched to a new branch '1382-rewrite'
> $ python setup.py build
> <snip>
> $ bin/tahoe --version
> allmydata-tahoe: 1.10b1.post68 [1382-rewrite: 
> 7b95f937089d59b595dfe5e85d2d81ec36d5cf9d]
> foolscap: 0.6.4
> pycryptopp: 0.6.0.1206569328141510525648634803928199668821045408958
> zfec: 1.4.24
> Twisted: 13.0.0
> Nevow: 0.10.0
> zope.interface: unknown
> python: 2.7.3
> platform: OpenBSD-5.3-amd64-64bit
> pyOpenSSL: 0.13
> simplejson: 3.3.0
> pycrypto: 2.6
> pyasn1: 0.1.7
> mock: 1.0.1
> setuptools: 0.6c16dev4
>
>
> Original output from tahoe check --raw:
> {
>  "results": {
>   "needs-rebalancing": true,
>   "count-unrecoverable-versions": 0,
>   "count-good-share-hosts": 2,
>   "count-shares-good": 10,
>   "count-corrupt-shares": 0,
>   "list-corrupt-shares": [],
>   "count-shares-expected": 10,
>   "healthy": true,
>   "count-shares-needed": 4,
>   "sharemap": {
>    "0": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ],
>    "1": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ],
>    "2": [
>     "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa"
>    ],
>    "3": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ],
>    "4": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ],
>    "5": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ],
>    "6": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ],
>    "7": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ],
>    "8": [
>     "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa"
>    ],
>    "9": [
>     "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya"
>    ]
>   },
>   "count-recoverable-versions": 1,
>   "count-wrong-shares": 0,
>   "servers-responding": [
>    "v0-ylkbcys5oqliy26d6s6kuwk5nmw5ktlcxmx254dfprm4rwrojhya",
>    "v0-7ags2kynskk5rrmbyk6yzjzmceswxh7x5lekghwsfbwdpfeaztxa",
>    "v0-jqs2izy4yo2wusmsso2mzkfqpqrmmbhegtxcyup7heisfrf4octa",
>    "v0-rbwrud2e6alixe4xwlaynv7jbzvhn2wxbs4jniqlgu6wd5sk724q"
>   ],
>   "recoverable": true
>  },
>  "storage-index": "rfomclj5ogk434v2gchspipv3i",
>  "summary": "Healthy"
> }
>
>
> Then I try to re-upload the unbalanced file:
> $ bin/tahoe put /tmp/temp_file
>
> Error: 500 Internal Server Error
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py", 
> line 677, in _done
>     self.request.complete(res)
>   File 
> "/usr/local/lib/python2.7/site-packages/foolscap-0.6.4-py2.7.egg/foolscap/call.py", 
> line 60, in complete
>     self.deferred.callback(res)
>   File 
> "/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.3-amd64.egg/twisted/internet/defer.py", 
> line 380, in callback
>     self._startRunCallbacks(result)
>   File 
> "/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.3-amd64.egg/twisted/internet/defer.py", 
> line 488, in _startRunCallbacks
>     self._runCallbacks()
> --- <exception caught here> ---
>   File 
> "/usr/local/lib/python2.7/site-packages/Twisted-13.0.0-py2.7-openbsd-5.3-amd64.egg/twisted/internet/defer.py", 
> line 575, in _runCallbacks
>     current.result = callback(current.result, *args, **kw)
>   File 
> "/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", line 
> 604, in _got_response
>     return self._loop()
>   File 
> "/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", line 
> 455, in _loop
>     return self._failed("%s (%s)" % (failmsg, 
> self._get_progress_message()))
>   File 
> "/usr/local/lib/python2.7/site-packages/allmydata/immutable/upload.py", line 
> 617, in _failed
>     raise UploadUnhappinessError(msg)
> allmydata.interfaces.UploadUnhappinessError: shares could be placed or 
> found on 4 server(s), but they are not spread out evenly enough to 
> ensure that any 4 of these servers would have enough shares to recover 
> the file. We were asked to place shares on at least 4 servers such 
> that any 4 of them have enough shares to recover the file. (placed all 
> 10 shares, want to place shares on at least 4 servers such that any 4 
> of them have enough shares to recover the file, sent 4 queries to 4 
> servers, 4 queries placed some shares, 0 placed none (of which 0 
> placed none due to the server being full and 0 placed none due to an 
> error))
>
>
>
>
>
> On 09/17/13 08:45, Kyle Markley wrote:
>> It would be my pleasure.  But I won't have time to do it until the 
>> weekend.
>>
>> It might be faster, and all-around better, to create a unit test that 
>> exercises the scenario in my original message.  Then my buildbot 
>> (which has way more free time than I do) can try it for me.
>>
>> Incidentally, I understand how I created that scenario.  The machine 
>> that had all the shares is always on, and runs deep-check --repair 
>> crons.  My other machines aren't reliably on the grid, so after 
>> repeated repair operations, the always-on machine tends to get a lot 
>> of shares.  Eventually, it accumulated shares.needed, and then a 
>> repair happened while it was the only machine on the grid.  Because 
>> repair didn't care about shares.happy, this machine got all 
>> shares.total shares. Then, because an upload cares about shares.happy 
>> but wouldn't rebalance, it had to fail.
>>
>> A grid whose nodes don't have similar uptime is surprisingly 
>> fragile.  Failure of that single always-on machine makes the file 
>> totally unretrievable, definitely not the desired behavior.
>>
>>
>>
>> On 09/16/13 09:57, Zooko O'Whielacronx wrote:
>>> Dear Kyle:
>>>
>>> Could you try Mark Berger's #1382 patch on your home grid and tell us
>>> if it fixes the problem?
>>>
>>> https://tahoe-lafs.org/trac/tahoe-lafs/ticket/1382# immutable peer
>>> selection refactoring and enhancements
>>>
>>> https://github.com/tahoe-lafs/tahoe-lafs/pull/60
>>>
>>> Regards,
>>>
>>> Zooko
>>> _______________________________________________
>>> tahoe-dev mailing list
>>> tahoe-dev at tahoe-lafs.org
>>> https://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev
>>
>>
>
>

-- 
Kyle Markley