[tahoe-dev] tahoe-lafs suitable for 500TB ~ 3PB cluster?

Tue Apr 23 15:17:05 UTC 2013

> On Sun, 21 Apr 2013 17:53:45 -0400
> Avi Freedman <freedman at freedman.net> wrote:
> 
> > The biggest issue might be added latency in a delivery system.  
> 
> I'ld love to get more insight on this.

(Note: this is all theoretical at this point based on what Einstein would have
called "thought experiments")

There are actually 2 related issues I am concerned with.  
The second may not be a big challenge with enough disks.

The first re latency is that you are composing the failures and timeouts of N backend
fetches at all times.  So with flaky disks, backplanes, systems, etc - the chances
go up that there will be a delay in grabbing a file or chunk thereof.

The second potentail issue is re IOPS.

You should expect worse than RAID sucking of IOPS out of the systems -
i.e. for 1mb byte range reads on a Swift or similar stack, the disk seeks somewhere,
gets the bytes (1mb-8mb should all order of magnitude be contiguous, worst case a constant
N=2 to grab say a 1mb video chunk).

With RAID with a 1MB stripe size you'd expect N=2-4 (since things won't be aligned) disk
IOPS.

With RAID with a bad (128k) stripe it'd be at least 9 IOPS across say 8 disks.

With erasure coding you'd probably be tempted to be failure tolerant but efficient so
if you did something like 8,14 or 10,15 or 16,21 or whatever (since 3,10 is worse
efficiency than Swift gives you now), you are guaranteed to have to get at least 8
IOPS going to grab your content. 

So if you are building a delivery cluster and need to back into gigabits/second based
on the # of disks, your total throughput is N * 3 worse than w a Swift type approach
(where N is the N in N,M).  Because with Swift at scale you have 3 disks you can do
about 1 IOP to and with erasure coding you have N disks you have to touch to get an
object.

What Scality does is build a ring (volume) of unredundant disks to effectively act
as a cache in front of an erasure coded 'deep and cheap' volume.

I have always thought this is the optimal way to architect a content storage system -
but it does break down (well, not break, just IOPS efficiency goes down) if you have
super long-tail stuff.

> > If you were going to serve byte range fetches without caching it'd probably need a lot
> of testing.
> 
> because the performance is unknown or because tahoe might be buggy in that use case?

Just performance, see above.

The big Q for me (and I haven't looked at the code yet or tried it) is how 
aggressive the timeouts can be.

When the client needs to pull N chunks, if one backend server is slow, what is the
timeout and retry?

And, for you, how big are the on-disk chunks that objects are stored in.

I am assuming 1MB would be a typical byterange request for broadband clients for
adaptive bitrate encoding video?

If the block size is too small (say 50k which I think it might have used to have
been), you have a lot of latency from doing M fetches each from at least N targets.
If it's too big and you have to reconstruct a whole block to pull 1GB out that
introduces extra overhead and data transfer latency also.

I'm not sure if something like this got implemented?

https://tahoe-lafs.org/trac/tahoe-lafs/ticket/397

Which introduces another question - 

If the retrieval blocksize is larger than the # of bytes needed, can just a
subset be retrieved from each block?  Not sure how that works with erasure
coding. 

> > And Dieter, you're welcome to join in when we do it and have a local
> > machine in the local infrastructure if you want to do some testing. 
> 
> cool, thanks. are you on irc anywhere? (I'm Dieterbe on freenode)

I am on AIM (avibgp) but haven't been an IRC fan since the late 80s/90s.

If I stick with Tahoe-LAFS I guess I'll have to get back into it, I guess.

> > Avi
> 
> thanks

Avi