[tahoe-dev] [tahoe-lafs] #778: "shares of happiness" is the wrong measure; "servers of happiness" is better

Fri Sep 25 01:38:21 UTC 2009

#778: "shares of happiness" is the wrong measure; "servers of happiness" is
better
--------------------------------+-------------------------------------------
 Reporter:  zooko               |           Owner:  kevan
     Type:  defect              |          Status:  new  
 Priority:  critical            |       Milestone:  1.5.1
Component:  code-peerselection  |         Version:  1.4.1
 Keywords:  reliability         |   Launchpad_bug:       
--------------------------------+-------------------------------------------

Comment(by kevan):

 The case you describe, if I understand it correctly, would be fit by this
 example:

 Suppose I have four servers, {{{s_1}}} through {{{s_4}}}. Suppose I have
 {{{k = 3, h = 3, N=10}}}. Suppose that the shares generated from a file f
 are numbered {{{f_n}}} (so {{{f_1}}} is share 1, and so on). Then the
 situation you describe is a problem if shares are distributed as follows:

   * {{{s_1}}}: {{{f_1}}}
   * {{{s_2}}}: {{{f_1}}}
   * {{{s_3}}}: {{{f_1}}}
   * {{{s_4}}}: {{{f_2, f_3, f_4, f_5, f_6, f_7, f_8, f_9, f_10}}}

 because the set {{{{s_1, s_2, s_3}}}} is not sufficient to recover the
 file, though it is the same size as {{{k}}}.

 Do you think we're on the same page?

 From what I understand of the placement algorithm, Tahoe-LAFS wouldn't
 normally distribute one share to more than one server.

 [source:src/allmydata/immutable/upload.py at 4045#L225 _loop] in
 {{{Tahoe2PeerSelector}}} contains the guts of the peer selection
 algorithm. As I understand it, it only has one request for a particular
 share out to peers at a time -- either by using {{{pop()}}} on the list
 containing the homeless shares, or by setting the slice of the homeless
 shares list out on the wire at the moment to the empty list. If this is
 the case, then it shouldn't allocate one share to more than one server.

 A more interesting case is if, for some reason, we attempt to upload
 {{{f}}} to a grid that has the following layout:

   * {{{s_1}}}: {{{f_1}}}
   * {{{s_2}}}: {{{f_1}}}
   * {{{s_3}}}: {{{f_1}}}
   * {{{s_4}}}: {{{\emptyset}}}

 I'll also stipulate that {{{s_1, s_2}}} and {{{s_3}}} will not accept any
 more shares, while {{{s_4}}} will accept every share it is asked to
 accept. Let's walk through what happens when I try to upload {{{f}}}.

 I'll assume for simplicity that {{{_loop}}} starts with
     {{{
     self.homeless_shares = [f_1, f_2, f_3, f_4, f_5, f_6, f_7, f_8, f_9,
 f_10]
     self.uncontacted_peers = [s_1, s_2, s_3, s_4]
     }}}

 {{{_loop}}} will start by checking to see if there are any homeless
 shares. There are, of course. It then checks to see if there are any
 uncontacted peers. At this point in the execution, there are, so it pops
 the first uncontacted peer ({{{s_1}}}) off of the list of uncontacted
 peers, and pops the first homeless share ({{{f_1}}}) off of the list of
 homeless shares. It then asks {{{s_1}}} to store {{{f_1}}}. Recall that
 {{{s_1}}} already has {{{f_1}}}. Since the {{{StorageServer}}}
 ([source:src/allmydata/storage/server.py at 3841#L35]) tells remote callers
 about all of the shares it has for a given storage index, {{{s_1}}} tells
 the {{{Tahoe2PeerSelector}}} instance that it has {{{f_1}}}, and that it
 has not allocated any shares. This is handled by the {{{_got_response}}}
 method of {{{Tahoe2PeerSelector}}}, which sees that {{{s_1}}} has not
 allocated anything, but that it already has {{{f_1}}}: that means that
 {{{f_1}}} is not added to the homeless list again, and that {{{s_1}}} is
 set to be the prexisting peer with {{{f_1}}}.

 {{{_loop}}} will now ask {{{s_2}}} if it can store {{{f_1}}} in the same
 way. {{{s_2}}} will reply (in the same way as {{{s_1}}}) that it can't
 (i.e.: that it hasn't allocated {{{f_2}}}), but that it happens to have
 {{{f_1}}}. This causes {{{_got_response}}} to set {{{s_2}}} as the
 prexisting share with {{{f_1}}}, overwriting the previously set value
 ({{{s_1}}})

 The same process occurs with {{{s_3}}}: the end result of the
 {{{_loop/_got_response}}} combination in that case is that {{{s_3}}} is
 now the prexisting share with {{{f_1}}}.

 Finally, {{{_loop}}} asks the last uncontacted peer, {{{s_4}}}, to store
 {{{f_2}}}. {{{s_4}}} replies that it doesn't have any preexisting shares
 for the storage index, and that it has allocated {{{f_2}}}. {{{s_4}}} is
 recorded in {{{self.use_peers}}} as having done this.

 There are now no more uncontacted peers, so the {{{Tahoe2PeerSelector}}}
 instance will attempt to store a proportional amount of the homeless
 shares (see [source:src/allmydata/immutable/upload.py at 4045#L263]) on each
 of the remaining servers. {{{s_1, s_2}}}, and {{{s_3}}} will refuse to
 accept any of these, and will not be used in any further queries as a
 result. {{{s_4}}} will accept all of the shares that
 {{{Tahoe2PeerSelector}}} wants to put on it, so it will be re-appended to
 the list of peers to consider. On the next pass, {{{_loop}}} will attempt
 to store all currently homeless shares on {{{s_4}}}, and will succeed.

 {{{Tahoe2PeerSelector}}} now has no more shares to allocate, and executes
 the comparison that Zooko mentions. To get {{{servers_with_shares}}},
 {{{self._servers_with_shares}}} takes the union of the set of peers that
 we're actually uploading shares to and the set of peers that we've
 recorded as already having shares. The latter set is built using the
 prexisting shares structure that is updated each time a response is
 received from {{{s_1}}}, {{{s_2}}}, or {{{s_3}}}: this will never have
 more than one server for {{{f_1}}} (or any {{{f_n}}}). The only other
 server in this list will be {{{s_4}}, since it has all of the other
 shares. {{{_servers_with_shares}}}, then, returns a list of two servers,
 and {{{Tahoe2PeerSelector}}} correctly fails, because 2 is less than 3.

 One example is obviously not proof that the peer selection algorithm
 guards against this in all cases, though, so maybe an additional test (or
 revising that one) would be a good idea. Did you have anything in
 particular in mind that would be a better indicator?

 This is perhaps off-topic, but if I understand the execution of
 {{{_got_response}}}, it seems like we don't say that a query has made
 progress (or that it is a good query) unless we know that the remote
 server has either allocated shares for us, or we learn that it already has
 shares that we think are homeless. This means that if I ask a server to
 store one share, and it happens to have that share (but no others), its
 response isn't considered successful, because the share I asked it to
 store isn't in homeless shares (since we popped it/destructively modified
 homeless shares when we removed it), and nothing is allocated. Is this
 intentional?

 (whew, that was a lot longer than I was planning on it being :-/)

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/778#comment:52>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid