[tahoe-dev] tahoe-lafs suitable for 500TB ~ 3PB cluster?

Avi Freedman freedman at freedman.net
Sun Apr 21 21:53:45 UTC 2013

Hi, Dieter.  I am newer to this than you are though I've been poking
around for a month or two thinking about using Tahoe for an online
storage service and Zooko has graciously answered a bunch of my
questions and pointed me to trac tickets/discussions about some of
the scaling issues.

I hope I can contribute and not fuzz things further but since we're
looking at some overlapping use cases I'll try to answer or give
some suppositions and thoughts.

> > I'm looking to store video files (avg about 150MB, max size 5GB).
> > the size of the cluster will be between 500TB to 3PB (usable) space, it depends on how feasible it would be to implement.
> > at 500TB i would need 160Mbps output performance, at 1TB about 600Mbps, at 2TB about 1500Mbps and at 3PB about 6Gbps.
> > output performance scales exponentially wrt clustere size.
> I don't understand this. Why do output (what we would call "download"
> or "read") bandwidth requirements go up when the cluster size goes up?
> Oh, I guess because you need to service more users with a larger
> cluster.

The client-side bandwidth to, say, delivery or encoding boxes would be
higher than with Swift because you'd be pulling the parity over the wire.
But that shouldn't really matter.

The biggest issue might be added latency in a delivery system.  
For encoding it should all be streaming.  If you were going to 
serve byte range fetches without caching it'd probably need a lot
of testing.

> By the way, I'm planning to do a very similar experiment with someone
> else within a week or so. If that experiment comes together, we'll
> post our results to this list.

I'll chime in since it's for potential havenco and ServerCentral use
that Zooko has offered folks to work with us on this...

We're going to be taking 9 45-disk dual E5520 24gb RAM systems 
and testing Tahoe (one proc per physical disk).  All JBOD controllers
(LSI 1068 or such) and Supermicro JBOD/servers, though in production
you might want higher quality stuff as the LSI backplanes in the 
Supermicros introduce latency when disks start to fail, and that could 
increase the chance of any given fetch experiencing latency unless
or even if Tahoe-LAFS has aggressive timeouts before it grabs extra

And Dieter, you're welcome to join in when we do it and have a local
machine in the local infrastructure if you want to do some testing.  
We're setting it up in a Usenet infrastructure so we may write a 
NNTP<->API interface or just use FUSE to access for testing.

Everything will be connected by 10gig either initially or in a week
or so.

We're even going to test running work on the filesystem on the boxes
and also running other workloads (Usenet) on the same disks potentially.

So, for example, if Tahoe-LAFS didn't suck full CPU, you might want
to run delivery and/or encode procs on the same systems as the disks.

> There's a performance problem in Tahoe-LAFS, but it isn't *overhead*,
> it is unnecessary delay. Network operations often stall  unnecessarily
> and take longer than they needed to. But they aren't busily saturating
> the network while they do that, they're leaving it idle while they do
> that, so it doesn't interfere as much as it might with other network
> usage.

One of the things I'd want to take a look at in production - any methods
to tune timeouts and aggressively start getting more chunks if one fails?

For a local gateway use case (less secure than all client-driven, I think) 
this could be reasonable as:

- one can assess the latency that's reasonable much better (no Internet delays)
- no inreased Internet $ to do the fetches

> The problematic cases have to do with the long-term. What happens when
> you've spread a file out across 10 servers, and after a few years, 7
> of that original tranch have been decomissioned? There's a process for
> re-spreading the file across newer servers, but that process isn't
> triggered automatically, and there are some cases where it doesn't do
> the right thing.

I can say that our planned deployment if we were to use it for Havenco
and/or ServerCentral would be to use enough extra parity that the expected 
drive failure rate in a reasonably worst case would allow us to have at 
least 6-9 months to worry about doing a centralized distribtued automated 
rebuild system or incent others to with $.  There are also if I understand 
correctly some security issues with having a well-connected proxy do rebuild 
on behalf of the end user but for just-efficiency for local storage for an
enterprise all that goes out the window.

Another set of issues to look at is around expiring old content -
not sure if that would be a concern for you.  I think (again I'm still
conceptual and haven't fully digested all the docs) right now the
clients are expected to refresh the TTL on the content.  In your
scenario that may not be an issue, and there may also already be
a module that can do that in bulk for *.  Or maybe one could set 
everything to not-expire than manually set leases to expire for
specific chunks?  Again apologize for my fuzziness here but someone
knowledge-able will probably chime in, I expect.

There are a bunch of tickets around this that I looked but didn't digest.

> Regards,
> Zooko


More information about the tahoe-dev mailing list