[tahoe-dev] tahoe-lafs suitable for 500TB ~ 3PB cluster?

Plaetinck, Dieter dieter at vimeo.com
Tue Apr 23 14:45:33 UTC 2013

On Sun, 21 Apr 2013 15:14:33 -0600
Zooko O'Whielacronx <zookog at gmail.com> wrote:

> Hi, Dieter! Welcome.
> On Tue, Apr 16, 2013 at 11:55 AM, Plaetinck, Dieter <dieter at vimeo.com> wrote:
> > I'm looking to store video files (avg about 150MB, max size 5GB).
> > the size of the cluster will be between 500TB to 3PB (usable) space, it depends on how feasible it would be to implement.
> > at 500TB i would need 160Mbps output performance, at 1TB about 600Mbps, at 2TB about 1500Mbps and at 3PB about 6Gbps.
> > output performance scales exponentially wrt clustere size.
> I don't understand this. Why do output (what we would call "download"
> or "read") bandwidth requirements go up when the cluster size goes up?
> Oh, I guess because you need to service more users with a larger
> cluster.

We currently have another (expensive) cluster running which serves a broad range of traffic:
everything from cold files to very hot/popular ones. we pay for this service by the stored GB/month (irrespective
of amount of traffic),
so I'm looking into building a new cluster to partially, or completely replace this, to save money.
if i replace it partially, i would make most sense to move mostly the cold files, and leave the hot ones where they are. (for our own cluster, we do need to pay for traffic, and also cold files == less infrastructure requirements)
this is why the cluster I'ld like to build is anywhere between 500TB-3PB (depending on costs) and as it gets larger,
the more hot files it will need to store, so the output bandwidth increases exponentially in function of cluster size. input bandwidth is negligible and is basically storing every file once (and then keeping it forever)

> There's a performance problem in Tahoe-LAFS, but it isn't *overhead*,
> it is unnecessary delay. Network operations often stall  unnecessarily
> and take longer than they needed to. But they aren't busily saturating
> the network while they do that, they're leaving it idle while they do
> that, so it doesn't interfere as much as it might with other network
> usage.

what's going on there? is there a ticket for this? I'ld like to know about any such issues.

> >  does it have a lot of consistency checking overhead? what does it have to do that's not just (for ingest) splitting incoming files, sending chunks to servers to store them to disk, and on output the reverse? i assume the error codes are not too expensive to compute and to check because the cpu has opcodes for them?
> It uses SHA256 and Merkle Trees, and it tests integrity both before
> and after the erasure-decoding step. This is really heavy-duty from
> the perspective of filesystems folks, but we came from the perspective
> of crypto folks, where SHA256 is a strong standard.

Is there any "official" architecture page? I only found
* do I understand it correctly that the client nodes are responsible for chunking up files and en/de coding files from/to erasure coded chunks?
* the introducer nodes just describe the cluster to clients, and the clients basically talk to storage nodes directly?
* how is balancing done, to ensure chunks for the same file are spread out in different failure domains? (different racks etc)

I would love to see recent numbers describing the input/output performance (in terms of Mbit, or files in/out requests) as a function of cpu/memory. (i.e. "on a 16core machine with 16GB ram you can request 100 files per second of avg filesize 200MB and it will saturate RAM first" or something. or the input/output performance as a graph in function of core size/memory usage, etc)

> Regards,
> Zooko


On Sun, 21 Apr 2013 17:53:45 -0400
Avi Freedman <freedman at freedman.net> wrote:

> The biggest issue might be added latency in a delivery system.  

I'ld love to get more insight on this.

> If you were going to serve byte range fetches without caching it'd probably need a lot
of testing.

because the performance is unknown or because tahoe might be buggy in that use case?

> And Dieter, you're welcome to join in when we do it and have a local
> machine in the local infrastructure if you want to do some testing. 

cool, thanks. are you on irc anywhere? (I'm Dieterbe on freenode)

> Avi


More information about the tahoe-dev mailing list