[tahoe-dev] Largest Scale of Tahoe grids

Fri Nov 4 13:04:54 UTC 2011

On Fri, Nov 4, 2011 at 1:42 AM, Jimmy Tang <jcftang at gmail.com> wrote:
>
> Also assuming that I do build a 100tb tahoe-lafs system across say 6 machines

You'd be better off using more machines.  Larger numbers of storage
nodes means that you can configure your system to use less redundancy
in the encoding, and hence "waste" less storage space, for the same
level of reliability.

Suppose that you build nodes which individually have, say, a 99%
chance of surviving for a year.  If you have 6 nodes, you can choose N
between 1 and 6, and k between 1 and N.  Each pair of choices gives
you an expansion factor of N/k meaning you can store 100/(N/k)
terabytes and it also gives you a probability p that a given file is
lost.  See my lossmodel paper (in the Tahoe docs) for how to calculate
p.

Here's a table of the options for a six-node grid.  k and N are the
Tahoe encoding parameters, C is the capacity of the resulting grid and
p is the probability that a given file is lost, assuming you have a
direct URI to it (directory trees complicate things and lower
probability of survival).

k      N      C      p
=      =      =      =
1	1	100	1E-2
1	2	50	1E-4
2	2	100	2E-2
1	3	33	1E-6
2	3	67	3E-4
3	3	100	3E-2
1	4	25	1E-8
2	4	50	4E-6
3	4	75	6E-4
4	4	100	4E-2
1	5	20	1E-10
2	5	40	5E-8
3	5	60	1E-5
4	5	80	1E-3
5	5	100	5E-2
1	6	17	1E-12
2	6	33	6E-10
3	6	50	1E-7
4	6	67	2E-5
5	6	83	1E-3
6	6	100	6E-2

If you look at the table a little, you'll see that the best
combinations of capacity and reliability are towards the bottom, with
N=5 or N=6.  If you choose a target reliability level, say p < 1E-6,
then your best option is k=3, N=6 which only gives you a capacity of
50T.

However, if you increase that to 10 nodes, then you can choose k=7,
N=10 and have a capacity of 71T with p=2E-7.  If you go to 20 nodes
your capacity at that reliability level increases to 80T.  And so on.

When you analyze capacity at a given reliability, or reliability at a
given capacity, you will find that the math always favors larger N,
and therefore larger numbers of nodes.

Of course, you also have to trade that off against cost.  It would be
easy to factor that into the model as well.

As to your original question; I haven't noticed any file size
limitations with Tahoe.  I've stored files as large as 2 GB.  And the
architecture imposes no limits on total grid size, either.  You will
want to look at what your bandwidth limitations might do, keeping in
mind that Tahoe limits your upload (and maybe download?) speed to that
of the slowest node in the grid.

--
Shawn