[tahoe-dev] newbie questions ++

Wed Sep 24 12:26:07 UTC 2008

On Friday 19 September 2008 22:29:39 Brian Warner wrote :
> On Fri, 19 Sep 2008 16:16:40 +0200
> Alain Baeckeroot <alain.baeckeroot at univ-avignon.fr> wrote:
> 
> > Hello
> 
> Welcome to the list!
> 
> > We just discovered Tahoe and are curious to try it for our internal
> > use, for a cheap, easy to install, and redundant distributed archive
> > system.
> 
> Excellent!

Thanks for all your precise answers.

We are going to test it with CLI and _with_ AES as it seems to be required.

Some more questions (we are paranoiac, and afraid of losing everything)
- Is there a known "single point of failure" (initiator server... ?).
   If yes, is it possible to workaround it ?

- What kind of tests should we do to push the system to the limits, and
 check how robust it is ?
  We think of "normal" tests:
  * concurrent read/write of different files
  * use server with different available free space.
  * filling filesystems to 100%
  * Maybe we can try with a big number of machines (100~1000 mainly windows)

- more generally, are there known/supposed issues we should test ?

Regards.
Alain.

Btw , i found this link about GPU for AES encryption (0.1 ~ 8 Gbits/s)
http://www.manavski.com/downloads/PID505889.pdf so if it is performance 
critical there is room for improvements :-)

> 
> > 1/ Is it possible to specify that we have N servers and want
> > tolerance to K failure ? Or to know the state of redundancy ?
> 
> FYI: in our nomenclature (which is reasonably close to the standard usage)
> "N" is the number of shares created, and "k" is the number of shares you
> need. So you can tolerate losing N-k shares and still be able to recover the
> file. If there are exactly N servers, you'll have one share per server, and
> you can lose N-k servers and still be able to recover the file. (when you
> have more than N servers, you can probably tolerate slightly more than N-k
> lost servers).
> 
> Tahoe currently uses 3-out-of-10 encoding (k=3, N=10), so you can tolerate 7
> lost shares (although you'd want to repair well before you got to that
> point). There's nothing magic about these (very conservative) numbers. To
> change them, just modify line 58 of src/allmydata/client.py, to change the
> definition of DEFAULT_ENCODING_PARAMETERS. (we don't yet have a simple
> configuration file to set this one, sorry).
> 
> The "check" and "deep-check" operations will tell you how many shares exist
> for any given file, which is what I think you mean by "state of redundancy".
> The docs/webapi.txt document describes how to perform these operations
> through an HTTP POST command.
> 
> > 2/ http://lwn.net/Articles/280483/ explains that files are locally
> > encrypted with AES, then splited with some kind of error correction
> > algorithm. Is it possible to not encrypt, and only use the Tahoe as a
> > redundant distributed filesystem ?
> 
> No, not currently. We've found that AES is fast enough (something on the
> order of 8MBps/64Mbps) that removing it wouldn't make the system work
> significantly faster, and the security properties are much easier to 
maintain
> (and the code is much simpler and safer) by making the encryption mandatory.
> 
> If you'd like to do some benchmarking with and without AES, I'd love to hear
> about your results. The upload/download process provides many performance
> statistics to show how fast different parts of the process took. If you were
> to patch src/allmydata/immutable/upload.py:417 to replace the 'AES' object
> with a dummy version, and again in download.py:51, then you could measure 
the
> performance without the encryption overhead. Make sure not to mingle files
> created this way with the regular encrypted ones, of course, since by
> changing the algorithm to remove the encryption, you'll break the property
> that a single URI (aka read-cap) refers to exactly one file.
> 
> > 3/ Did someone benchmark the performance on a LAN ? with CLI and/or
> > fuse ?
> 
> We have automated performance measurements, run both on a LAN (the "in-colo"
> test) and over a home DSL like (the "dsl" test").
> http://allmydata.org/trac/tahoe/wiki/Performance contains some summaries of
> the results and links to the graphs of performance over time, like this one:
> 
http://allmydata.org/tahoe-figleaf-graph/hanford.allmydata.com-tahoe_speedstats_rate.html .
> 
> We currently upload in-colo at about 1.4MBps/11.3Mbps, and download at about
> 2.3MBps/18.6Mbps . We think that by adding pipelining of segments, we should
> be able to at least double this rate (since from the graph of performance
> over time, you can see that we used to get 4.5MBps down, before we reduced
> the segment size last March).
> 
> These tests are all driven by code inside a Tahoe node. When driven by a
> webapi operation or a CLI command (which mostly uses the webapi interface),
> it will be necessary to transfer the data to/from the node over HTTP, so the
> performance will be slightly lower. We don't have any tests of performance
> through FUSE.
> 
> Other performance numbers of interest include how much latency there is
> (which matters more for small files than large ones), and the performance of
> mutable files (which are used to contain directories). The allmydata.org
> Performance page contains automated test results for these values too.
> 
> > 4/ About fuse modules, found in contrib/  
> > 	impl_a and impl_b said that only read is suportted
> >   but http://allmydata.org/~warner/pycon-tahoe.html says FUSE plugin:
> >   "allowing them to read and write arbitrary files."
> >   We would be very happy if read and write work under linux :-) (we
> > don't use atime, nor do tricky things on our filesystems)
> 
> Our linux FUSE modules are not very mature yet. My PyCon paper was meant to
> point out that a fully-functional FUSE plugin will allow arbitrary
> applications to access files inside the tahoe virtual filesystem, as opposed
> to the user needing a special FTP-like program to manually move files 
between
> the tahoe virtual filesystem and their local disk (where other applications
> could work on them).
> 
> The windows FUSE module (which is actually based on the SMB protocol) works
> fairly well, for both read and write, and is the basis for the allmydata.com
> commercial product. Rob Kinninmont is working on a Mac FUSE module, which
> ought to work on linux as well (the MacFUSE folks claim to be source-code
> compatible with Linux/BSD FUSE). I believe his work is both read and write,
> but I'll leave that to him to describe.
> 
> > 5/ Does it scale to TB filesystems ? 
> > 
> > Ideally we would like a ~15+ nodes with 500 GB each, and tolerance of
> > 3~4 faulty servers.
> 
> Yes. The allmydata.com commercial grid currently has about 40 nodes, each
> with a single 1TB disk, for a total backend space of 40TB. There is 
currently
> about 11TB of user data in this grid, at 3-out-of-10 encoding, filling about
> 36TB of backend space. With k=3/N=10, we can lose 7 disks and not lose any
> user data. When a disk fails, we will use the new Repairer (in the
> soon-to-be-released 1.3.0 version) to regenerate the missing shares onto a
> new server.
> 
> From a fault-analysis point of view, the number of files we'd lose if we 
lost
> 8 disks is small, and grows smaller with the total number of servers. This 
is
> because the files are pseudo-randomly distributed. I don't have the math on
> me right now, but I believe it is a very small percentage (the question to
> pose to your math major friends is: choose 10 servers out of a set of 40, 
now
> what are the chances that the 10 you picked include servers
> #1-#8?).
> 
> If you had 15 nodes with 500GB each (so 7.5TB of backend space), and wanted
> to tolerate 4 failures, you could use k=11/N=15, which could accomodate 
5.5TB
> of user data.
> 
> > Congrats for this very impressive job. Best regards.
> > Alain Baeckeroot.
> 
> Thanks!
> 
>  -Brian
>