[tahoe-dev] default values of K, H, N

Greg Troxel gdt at ir.bbn.com
Thu Jan 13 00:38:34 UTC 2011


  I think we should either change the default behavior to match the
  common user expectations, or else add documentation to, if possible,
  explain the surprising thing for them when they begin trying to use
  it.

I think the key issue is that there are several use cases and thus
several sets of expectations.  Trying to have one set of defaults for
those multiple cases will IMHO simply not work.

  There are some reasons (mostly to do with performance and
  availability) why someone might want N > H, but the newbies seem to
  expect H == N. Perhaps we should set H == N in the defaults and then
  let more sophisticated users tune the (K, H, N) for their particular
  grid and their preferences?

There is some merit in that.  But, in thinkig about (K, H, N), the other
issue is how H relates to S, which is the number of servers which will
accept shares.

  1. (Safety) Users who entrust valuable data to it without changing the
  defaults won't lose integrity, confidentiality, or data-preservation.

  2. (Unsurprisingness) Users will rarely be surprised by the default behavior.

  3. (Performance and Features) Users will get good transfer speeds, the
  ability to migrate or rebalance files without having to re-encode
  them, better storage efficiency, higher fault-tolerance, etc.

  I would really like to prioritize them in this order.

I think this is broadly sensible.  But there's another goal, which is
that setting up a small grid to experiment should have a reasonable path
to real use.  This can be accomplished by something like 'tahoe cp -R',
or really rsync between two sshfs/sftp mounts of multiple clients.  I
think it's important to be clear if our recommended strategy is to set
up a second grid and copy data, or to expand servers and do repair.
(Personally, I haven't yet trusted data to tahoe for availability
purposes.)

I know of three use cases that are at least reasonably normal:

  1) pubgrid test case: user sets up a client to access the pubgrid.  If
  they're considerate and want to store data, they'll also set up a
  stable server with a global address.

  2) test grid: user sets up minimal private grid with the intent of
  playing with tahoe to figure out what to do more seriously.  Probably
  one machine with introducer and storage server, and some number of
  clients.  The user has no reasonable expectation of redundancy.

  3) non-redundant grid: user sets up minimal private grid, wtih
  intention to store files, but no expectation of redundancy/reliability
  (beyond one disk).  At first glance this seems silly, but it gets one
  better confidentiality properties than NFS.

  4) proper grid: user sets up a grid with enough servers that the K/N
  coding gets RAID-like or better reliability.  This is the normal
  "production use" case.

I would argue that case 3 is odd, because given how tahoe works, almost
everyone would want to set up multiple servers (or multiple server
processes on multiple disks) to move to 4, or at least be wanting to
move to 4.

There are obviously subtle variations on 4, where people worry about
administrative and physical redundancy.  One person might consider a box
with 10 disks and 10 server processes to count as 10 independent servers
(if they treated drive failure as the issue, and believed that they
could restore the servers, and didn't worry about the computer
exploding).

Case 1 is like 4, except that it's very messy.  Given my dissing of case
3, I think we're basically talking about case 2 and case 4.  In case 2,
efficiency doesn't matter.  In case 4, the 3/10 encoding seems
reasonable.

So given 3/10 as the default, because it's ok for 2, works for 4, and
has a clean migration path, we're only talking about H.  Given a grid
with S servers usually, with a high probability of >= S', then it makes
sense to set H to S'.  If one isn't comfortable with that K/H/N
probability, the primary fix is to increase S', not to worry about the
encoding - I'd argue that almost all such trouble is because S' is <=6.

So I've pretty much convinced myself that 3/7/10 is a reasonable
default.  I don't think you can go much below that without compromising
#1 (safety).

So my vote is

tahoe create-node     # makes 3/7/10
tahe create-demo-node # makes 3/1/10

and create-demo-node puts in a comment

# Note that these paramaters result in a lack of redundancy to
# accomodate test grids with a single server.  For real use, set
# shares.independent to 7.


This lets people increase S, declare S', and ratchet up H.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20110112/57100d19/attachment.asc>


More information about the tahoe-dev mailing list