[tahoe-dev] detecting weak uploads and share rebalancing Re: Share rebalancing
zooko at zooko.com
Mon Oct 19 21:36:44 UTC 2009
N.B. there are four different arguments about issue #302 in this
letter; don't mix them up.
On Monday,2009-10-19, at 13:34 , Brian Warner wrote:
> .. and we'd suffer from the "lumpy distribution" problems that are
> discussed in ticket #302, where servers get unequal load depending
> upon where they sit in the ring, and where servers who become full
> can dump inordinate amounts of traffic on the poor node just
> clockwise from them.
... [reordering quotes from your message]
> I still believe that permuted-ring gives us better overall
> behavior. I'm willing to be proved wrong, though :).
1. I wrote a simulation which convinced me that this is wrong --
that both share placement algorithms have an indistinguishable (and
highly varying) pattern of servers filling up. However, the results
that I posted to tahoe-dev were confusing and hard to follow, and you
seem to have ignored them. I see that I didn't link them in from
#302 either. I should go find that letter in tahoe-dev archives and
link to it from #302. Here it is: http://allmydata.org/pipermail/
> And, an attacker who took out several neighboring-in-id-space
> servers would kill or seriously damage several files (if they took
> out N-k+1 consecutive servers in a non-trivially utilized grid,
> they'd be guaranteed to kill some files).
[snip more interesting consequences of placement strategy on an
attacker who attacks many servers]
2. Neat! I hadn't thought of this malicious case before. Perhaps
you could add a link from #302 to your letter about the malicious case.
3. We were already familiar with designs like Chord (and Chord File
System) and Kademlia when you (Brian) came up with the "permute per
fileid" trick. It should be seen as an (arguable) improvement on
those designs. However, Chord and Kademlia have been deployed with
success, sometimes on a massive scale -- e.g. Cassandra DB  and
Vuze  -- where load-balancing is also an issue. This suggests
that either this phenomenon isn't a problem in many situations in
practice (which would be consistent with my simulation -- argument 1)
or that the designers of Cassandra DB and Vuze ought to think about
adopting the permute-per-fileid trick (or both).
In fact, Cassandra's unique appeal among "post-relational" (a.k.a.
"nosql") databases is that it supports range queries, and the way it
does so relies upon the "natural" chord ordering. If you're familiar
with Chord, you can think of Cassandra as being a lot like Chord plus
the added feature of *not* running the keys through a secure hash to
load-balance them. This is an "improvement" on Chord in the opposite
direction of our improvement! :-) It makes the load-balancing
properties much *worse* than the standard Chord load-balancing.
Assuming anyone actually uses Cassandra in this mode, then this
demonstrates that the sort of "balancing tools" discussed in e.g. 
are usable to some people.
4. This thread started because Shawn Willden needed to do some
mucking about with his shares, and the permute-per-fileid feature
makes it harder for him to muck his shares. This is a real live
example of my argument (in e.g. http://allmydata.org/pipermail/tahoe-
dev/2008-July/000672.html ) that the simpler placement strategy can
help people administer their Tahoe-LAFS deployments. I need to link
to this thread from #302 and claim that this is an example.
Of all these four arguments, I think argument 4 is the most important.
I think the next steps here are to document the arguments better on
ticket #302 and also to create a new separate ticket which is all
about making per-file-permutation optional (leaving #302 as the
repository of arguments about which is better in what situations,
whether the default should be changed, etc.).
tickets mentioned in this email:
http://allmydata.org/trac/tahoe/ticket/302 # stop permuting peerlist,
use SI as offset into ring instead?
 http://wiki.apache.org/cassandra/PoweredBy # says Cassandra is
used for inbox search at Facebook, which is up to 40 TB of data
across 120 machines in two separate data centers
 http://vanish.cs.washington.edu/pubs/usenixsec09-geambasu.pdf #
says that Vuze has a million nodes at a time, spanning the globe
More information about the tahoe-dev