[tahoe-dev] [tahoe-lafs] #599: maybe add share-metadata: "where-are-the-other-shares" hints

Wed Jan 6 20:59:30 UTC 2010

#599: maybe add share-metadata: "where-are-the-other-shares" hints
--------------------------+-------------------------------------------------
 Reporter:  warner        |           Owner:           
     Type:  enhancement   |          Status:  new      
 Priority:  major         |       Milestone:  undecided
Component:  code-storage  |         Version:  1.2.0    
 Keywords:  download      |   Launchpad_bug:           
--------------------------+-------------------------------------------------

Comment(by warner):

 I was thinking more about this last night, in the context of #872, #447,
 #573, and other enhanced peer-selection goals. Many of these tickets
 express
 a desire to improve certain properties of the distributed files, typically
 reliability (by improving geographical diversity), reduced repair cost
 (getting k+1 shares into each datacenter), increased download speed
 (placing
 several shares near each downloader), and increased fairness of
 reliability
 over time (by trying to fill all servers at the same time instead of at
 the
 same rate, flattening out the entropy(peer-selection-process)-vs-time
 curve
 as described in comment:ticket:872:4).

 All of these goals are hard to accomplish right now, because we want small
 fixed readcaps for each file, and that doesn't leave the downloader with a
 lot of information available to do the matching peer-selection algorithm.
 The
 fact that readcaps are immutable means that the repairer/rebalancer (which
 need to add shares to new places, and sometimes remove them from old ones)
 are changing the serverlist, and yet the downloader must be able to find
 the
 new shares efficiently using the old (fixed) readcap. Even if the file
 isn't
 repaired and all shares exist in the same places, the system needs to be
 tolerant of server churn.

 The Tahoe2 selection algorithm provides small fixed readcaps, handles
 share
 movement due to repair (assuming the repairer uses Tahoe2 too), and has
 reasonable tolerance of server churn (adding 10% new servers will increase
 the downloader's search distance by an average of 10%). And it provides
 fill-at-same-rate load-balancing behavior, which is easy to understand and
 happens to give you fill-at-same-time behavior if you have homogeneous
 servers. But it is hard to accomodate any of the other goals contained in
 the
 tickets above, or to trade off some goals against others.

 "All problems in computer science can be solved by introducing another
 layer
 of abstraction". In the context of this ticket, we could accomplish more
 of
 these goals simultaneously by introducing an intermediate
 where-are-the-shares table. This table would contain a mutable mapping
 from
 storage index to:

  * verifycap
  * share locations: list of (serverid, confidence level)
  * other serverlist locations: list of serverids

 The serverid strings would merely need to be short prefixes of the full
 serverids, perhaps as little as 4 bytes, since they would be expanded by
 looking for matches in the full serverlist, and infrequent collisions
 would
 merely increase the search distance slightly. (this approach, rather than
 including a full FURL, trades off space against continued dependence on an
 Introducer and/or a notion of enumerable grid membership).

 Each storage server would offer access to "location slots" in this table,
 just like they currently offer mutable-share slots. However, the slots
 contain plaintext, and would use have writecaps and readcaps. Instead, the
 remote operations would look like:

  * allocate location slot
   * later, set storage index and verifycap (once only)
  * propose locations (serverid1, serverid2, ...)
  * propose removal (serverid1, serverid2, ...)
  * fetch share location list
  * fetch other serverlist location list

 The storage servers would be responsible for confirming the presence or
 absence of shares themselves. To this end, each table entry has a
 "confidence
 level", which the server uses to keep track of how much validation it has
 performed. When a "propose location" message arrives, the serverids are
 added
 with the lowest confidence level (0). The server sends do-you-have-share
 messages to those servers, and if they claim "YES", the level is raised to
 1.
 Later, when bandwidth allows, the server can fetch a couple of blocks at
 random, and the hash chains, and make sure everything verifies up to the
 locally-held verifycap, and then raises the level to 2. Eventually, the
 server can fetch and verify the entire share, and raise the level to 3.

 Upon receipt of a "propose removal" message, the server marks the entry as
 "possibly removed", and queries the indicated server directly. If and only
 if
 the server response to the DYHB with a "NO", then the entry is actually
 removed.

 The server would need to run a service to manage this verification
 process,
 and to allocate a certain amount of bandwidth to it. This manager should
 also
 query the other serverlist holders, to update each other on the current
 share
 locations.

 The idea is that shares would be uploaded to completely arbitrary
 locations,
 to accomplish whatever placement goals were desired (load balancing,
 download
 speed, datacenter diversity, etc). Then the location slots would be
 allocated
 on well-known servers (using the normal Tahoe2 algorithm), and filled with
 the locations of the shares. The location slot holders would tell each
 other
 about the shares to flood this information out to all slots.

 A downloader would use Tahoe2 to find at least one location slot, read the
 serverlist out of it, and then query those servers for shares. They could
 read from multiple location slots if they somehow got out of sync. The
 likely
 implementation would be to send "fetch share location list" messages in
 parallel to the first N servers in the permuted list, and upon receipt of
 the
 first positive response, send DYHB messages in parallel to everything it
 mentions.

 After the repairer uploaded new shares, it would locate the location slots
 and inform them about the new share. Through flooding, eventually all
 location slots would learn about the new share. When a share is removed
 from
 a server (via lease expiration, or mutable rebalancing), the "propose
 removal" message would be sent, and the location slot would confirm.

 If the repairer notices that there are too few location slots, it can
 allocate and fill new ones. It would be appropriate to have "N" location
 slots. Since we're using 1-of-N encoding here (pure replication), the
 reliability of the location slots should be much higher than the
 reliability
 of the target file, and the extra layer of indirection should not
 significantly reduce reliability.

 Having the location slot servers perform their own verification removes
 the
 need for an extra layer of writecaps (which would be needed by the
 repairer
 to update the location list, making it harder to have storage servers
 drive
 the repair process themselves). However it does increase their workload
 significantly, making this look more like the old Tahoe1 scheme as
 mentioned
 above. On the other hand, this could provide the basis for server-driven
 repair, where their periodic check on each other could provoke repair of
 unhealthy files without client involvement.

 What will prevent us from wanting to manipulate the choice of location-
 slot
 holders in the same way that those tickets want to manipulate the choice
 of
 share holders? Location slots are very small, so we'd be unlikely to want
 to
 balance the storage load as we do with (large) shares. They'd use 1-of-N
 encoding, so certain performance and reliability concerns are reduced. But
 for the others, we'd just have to hope that providing arbitrary location
 of
 shares is enough to get folks to tolerate these fixed location of
 location-slot holders.

 There would be a nontrivial performance hit to use location-slots.
 Uploaders
 would require one extra RTT, because the "propose location" messages could
 not be sent until all shares had been closed. Servers would spend
 considerable bandwidth and CPU checking up on each others shares. And
 downloaders would incur an extra RTT to query the location-slot holders in
 addition to the existing DYHB queries, however the DYHB queries would be
 highly likely to succeed, making the combination potentially faster in
 certain not-ideally-Tahoe2-distributed server selection schemes.

 Other ideas:

  * use a limited subset of servers for the location slots (i.e. nodes
 might
    offer share-storage, or location-slot-storage, but not both)
  * alternatively, try to put the location-slots on the same servers as the
    shares, so they share fate and we avoid the reliability hit

 Anyways, this feels like a lot of moving parts (and the server load
 worries
 me), but a layer like this would let us use arbitrary server-selection
 policies. It would also help us scale to millions of storage nodes, if the
 Tahoe2 permutation and frequent queries only needed to be done on a few
 hundred location-slot servers instead of the entire set of storage
 servers.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/599#comment:1>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid