[tahoe-dev] servers of happiness during repair

Thu Jun 9 19:30:42 UTC 2011

since this topic was previously discussed a lot by me with zooko and others on irc, and kim0 brought it up again on #tahoe-lafs yesterday, I'll try to sum it up here the best that I can, in the simplest way (as also requested by zooko, instead of filling comments in various tickets on the trac):

We all know what's RAID, it's not a substitute for a backup, but it makes possible for a filesystem to still work even if one or more disks fail.
So for example with RAIDZ on zfs if I have a 5 disks pool, and I loose 1 disk, I can take out the failed disk, put in a new disk, and make the pool resync the data on it, all while still working, uninterrupted, and without loosing data (so no need to recreate the filesystem and restore from backup).

Enter the concept of RAIC, redundant array of inexpensive clouds, so with Tahoe-LAFS I can have, for example, a filesystem that's distributed on 5 storage nodes, for the sake of keeping the example simple, each storage node is a single fixed hard disk on a separate physical computer.
So far so good, when I upload a file to this filesystem, like I previously did on my RAIDZ zfs pool, the file is striped across 5 storage nodes, each storage node holds a part of this file, called a "share".
Now if I loose 1 storage node, I can take out the failed computer, put in a new one, and make Tahoe-LAFS repair the data on the pool, BUT it can happen that the repairer, instead of simply recreating the lost share on the new (and empty) node, will put the share on one of the old nodes, that already has a share so I end up with this situation, where N is node and S is share:
N1[S1]
N2[S2 S5]
N3[S3]
N4[S4]
N5 
in the mean time the filesystem is still OK and can still work, because it only needs 4 shares on 5 to survive...
but what if then the N2 server dies? :/

so maybe my exaple is oversimplified, but the interaction of this behavior of the repairer and the various combinations of K:H:N can be subtle, and can tend to change the redundancy of your grid in some subtle ways that you weren't thinking when the filesystem was still working and when all the checks were returning that a file was OK because it had 5 shares on 5... but doing some simple IRL tests, I've seen that sometime it can happen that on a 5:10:10 grid a server will end up holding 3 shares (!!!) of a file, and that's a problem to me because I've intended a 5:10:10 grid to be able to sustain the loss of as much as 5 servers without loosing data, but if one of those servers holds 3 shares of a file, loosing 4 normal servers and that 3 shares server, will surely mean loosing data.