[tahoe-dev] Selecting optimal FEC parameters (also, idea for new peer selection policy)
shawn at willden.org
Mon Aug 24 21:00:28 UTC 2009
On Wednesday 19 August 2009 03:24:32 pm Brian Warner wrote:
> It would be neat if you could put your share-placement decisions into a
> function and get back a couple of predictive numbers: the file-survival
> probability, the predicted file-download speed (from various places),
> etc, then compare the outcome of a couple of different choices.
And even better if you can implement an efficient solver to find the
share-placement option out of a set of possibilities that optimizes for some
set of criteria. I'd like to find the share placement option that minimizes
FEC expansion consistent with meeting a reliability criterion.
> > 2. Retrieval performance is maximized when shares are retrived from
> > as many servers at once (assuming all are roughly equally responsive).
> Only if your downstream pipe is faster than the sum of the servers'
> upstream (but only if those servers aren't doing anything else with
> their upstream bandwidth). American DSL/cablemodem lines seem to mostly
> have an 8-to-1 ratio between down and up. So if you assume a homogenous
> mix of nodes, you don't get too much benefit from k>8.
That's a good point, although I'd set the limit a little higher. My cable
modem has a 15-to-1 ratio (6 mbps down, 384 kbps up). Zooko's point that the
lockstep download means you're limited to the speed of the slowest of the
servers times K would argue for pushing the limit higher still. I doubt
there will be many cases where k > 20 would be helpful.
> Also, this discounts the overhead of talking to lots of servers
> (especially in light of the serialized-peer-selection thread happening
> elsewhere on tahoe-dev), and the system-wide cost of involving lots of
> servers in your download operation.
Also a good point.
> Incidentally, could you use "N" to refer to the number of shares
> produced, rather than "M"? That would match the existing tahoe docs.
Hehe. I recall starting to use "M of N" terminology and being corrected and
told to use "K of M" terminology. I'll use whatever -- but "M of N" feels
most natural to me.
> We still need another letter to indicate the size of the grid.
Perhaps G would be a good choice.
> > Setting M to K**2 (with K = number of servers) ensures that any one of
> > the servers has all the information needed to restore any file, at the
> > cost of an FEC expansion factor of K.
> That's equivalent to using 1-of-(numservers) encoding (i.e. simple
> replication), and changing the downloader to pull multiple segments in
> parallel (i.e. the bittorrent approach).
Yes, except that it doesn't require changing the downloader.
> I'm not sure what I feel about that. I suspect that an expansion factor
> like that would be too much for a lot of people, but of course it
> depends heavily upon both the incentives involved (are you buying
> hardware for them, or asking to use some of their own?) and the
> who-downloads-what goals (having a full copy of my pictures at my mom's
> house means she can view them quickly).
I wasn't suggesting seriously that someone would want to do that. In fact,
for nearly all applications I think the expansion factor is prohibitively
expensive, except when the grid is very small to begin with (e.g. G=2).
> > If so, I'd like the upload to FAIL.
> I've been reluctant to agree with this sort of behavior, but I think I
> need to get over that.
I think so, too :-)
For backup applications, my view is that the most important characteristic I'm
looking for from Tahoe is reliability in the face of disk/server failures.
So if I do an upload which does not satisfy my reliability requirements, the
effort was entirely wasted. Even worse, if I don't know that it doesn't
satisfy my reliability requirements, then it was not just wasted, but
actually had negative value, convincing me that I safeguarded my data when in
fact I did not.
> Maybe if the upload() API included some sort of desired-reliability
> parameter, and if the Uploader concluded that it couldn't be achieved,
> the upload throws a ReliabilityGoalUnreachable exception (which, given
> enough UI work, might give the user the ability to try again with
> less-lofty goals). Maybe my reluctance has been based upon the mistaken
> notion that this "upload failed" response would be indistinguishable
> from a less-workaroundable failure, like having zero servers available,
> or a coding problem.
In one respect, I see your point, but from another perspective, I don't really
care if the issue was no servers, a syntax error or insufficient reliability.
In any case, the upload fails to achieve my goals and that makes it a problem
to be fixed.
I think in many cases, reliability issues can be addressed by multiplying
shares distributed until the statistical likelihood of failure falls below
the threshold, so allowing Tahoe to choose K and N based on G and some
per-server failure probabilities should generally ensure that the reliability
requirement can be met (unless the requirement is higher than 1 - P(all but
one servers fail)).
> I'll respond to the hamming-distance idea elsewhere.
I started to respond to your response, but after thinking it through, I think
your previous idea of putting the last server prefix in the URL is better.
Simpler and without any disadvantages that I can see.
More information about the tahoe-dev