[tahoe-dev] Load balancing

Sun Aug 30 02:13:23 UTC 2009

Aaron Cordova wrote:

> Beyond repair to replenish the number of desired shares when machines
> go missing, does Tahoe do any creation/replication of shares to
> balance the load when new machines are added to the storage grid?

Not really. If the Repairer is invoked, and there are new servers around
which weren't part of the original upload set, then the current repair
process will put shares on them. We're looking at improving the
repairer/uploader to be more aware of existing shares (i.e. once you see
the first pre-existing share, ask around to find the others before you
commit to the placement sharemap), in #362. It's an open question as to
whether this new uploader ought to move existing shares, or leave them
alone. It's a bandwidth hit, but it gets you rebalancing, but you need
more code to delete the old shares (or more likely cancel their leases
and wait for expiration).

If we treat shares not being in their ideal locations (i.e. not at the
top of the permuted serverlist) as "damage" (albeit a rather
low-priority form), and invoke the repairer on them, then moving them to
their ideal locations would provide the sort of adapt-to-new-servers
load-balancing you're talking about. #699 covers some of this.

Of course, this happens all the time for mutable files, since every
update witness a change to the serverlist, and the choice I made on that
was to try to leave the shares alone. #232 is about changing that to
auto-rebalancing, which I think would be a good idea. The mutable
storage-server protocol already has a command to delete the share (since
mutable writers can delete the contents anyways), so we can update the
client code to implement a rebalancing move with copy-then-delete, and
not have to change the servers at all.

The immutable protocol has no such provision, because of course you
don't want just anybody to be able to delete an immutable share. So it's
less obvious how a client-side move-share operation should be
implemented that preserves the safety of the shares.

It'd be nice to have a specialized form of repair for this, though,
since it can be done with a lot less work than the usual kind (where you
have to generate shares). In this case, you're just moving existing
shares from one place to another. If we improve the storage-server
protocol, you might even be able to talk the two servers into moving the
share directly, instead of pulling it down from one and then pushing it
back up to the other: this would consume the same amount of their
bandwidth, much less of yours. You can imagine a repairer looking at the
sharemap and figuring out the minimum number of moves necessary to get
all shares pushed up to the top of the permuted list.

> I imagine this could be implemented using an asynchronous, low-
> priority background operation, even one that runs independently of the
> software as it is, that simply directs the creation or moving of
> shares.

Yeah, there's a ticket (#543) about creating a "rebalancing manager",
which could be an out-of-band tool that talks to a bunch of storage
servers at once, using some new interface. It would ask each one for a
catalog of the shares it holds, use the same permutation as the clients
do to figure out the best homes for those shares, and then direct the
servers to push them around into the right places (and, with authority
over the storage servers, it could tell them to delete the old shares
right away, instead of waiting a month for them to expire). An external
tool like this would be appropriate for an allmydata.com-style grid, in
which you're trying to manage overall system load: the tool could move
things slowly enough to avoid interference with user operations. We
never got around to building it, though.

I'm not sure what the answer is for a non-centrally-managed grid, since
anyone you give this move-share power to can also just delete the
shares, and you don't really want to give them that. Maybe something
where servers vaguely trust each other, and server A is willing to
delete a share after it sends it to server B and gets an ACK back. But
that goes against some of the other security patterns we've established.
So the copy-then-wait-to-expire approach is the best I've been able to
think of so far.

thanks,
 -Brian