[tahoe-dev] webapps on top of tahoe-lafs: howto - newbie questions

Zooko O'Whielacronx zooko at zooko.com
Sat Apr 9 13:27:08 UTC 2011


Dear elcalamartevigila:

I'm sorry it took so long to reply. Your ideas that you posted last
October were very interesting. I really hope that you or someone who
comes after you moves forward with this whole "web apps on top of
tahoe-lafs" idea.

On Fri, Oct 8, 2010 at 9:45 AM, elcalamartevigila
<elcalamartevigila at gmail.com> wrote:
>
> (first of all, i'd like to apologize for making the following questions about
> concepts for which I don't have a solid grasp. please point me to concepts or
> docs you feel I should review more in-depth before addressing these things)

No need to apologize! We may be able to point you to specific docs in
the course of this conversation.

> I'm very interested in tahoe-lafs, from a "cypherpunk" perspective. I've
> successfully installed and used it in a small grid. I believe it has the design
> features that we were looking for (in terms of resilient, secure, self-managed
> infrastructure).

Yay!

> If I understand it correctly, tahoe-lafs is "only" a *distributed, secure
> filesystem*. By reading this thread [0] I assume that if I want to design any
> kind of "distributed webapp", I would need a layer on top of tahoe-lafs that would
> take care of:
>
> - managing the write- and read-caps. if I don't trust the server in which app is
>  running, I should manage my own node to upload files and be responsible for
>  the management of my own keys.
> - implementing ACLs: "distributing" read-caps to whoever should be able to read
>  them.
>
> (Is this correct so far?)

Sort of! Did you see the webapi?

http://tahoe-lafs.org/trac/tahoe-lafs/browser/trunk/docs/frontends/webapi.rst

It might have exactly what you need.

Perhaps we need a shorter doc to sit in front of webapi, which
describes only the basic functionality and has a more tutorial style.

> So I assume the logical way to have something running "on top" of tahoe would be
> using a traditional database in any framework (looking at the django canopy
> implementation for instance), and delegating the storage of *files* to the grid.
> ie, I upload a file to my traditional app, and it stores it on the
> grid, storing the caps in the database (or on another file on the grid). If I
> want to share my file with a friend, I share the read-cap by any means I can
> think of (using ostatus protocols, or xmpp, for instance).

The result of this architecture is that if you share with your friend
a reference into the app's namespace then your friend doesn't get the
end-to-end integrity and confidentiality guarantees. Instead your
friend relies on the app to serve up an uncorrupted file and not to
share it without authorization. What I mean by "a reference into the
app's namespace" is something like "look in my pictures collection on
my PHP site for April 9", or "GET
$myphpserver/assets/pictures/2011-04-09".

You also have the option to share with your friend a reference into
the Tahoe-LAFS namespace, i.e. "look in my pictures collection in
Tahoe-LAFS for April 9" or "GET
$tahoelafsgateway/uri/$READCAP/pictures/2011-04-09". If you do that
then your friend has two options: 1. use a Tahoe-LAFS gateway operated
by someone else to retrieve the file. In this case the Tahoe-LAFS
gateway will perform the cryptographic integrity check on the file and
then serve the file to your friend, so just like the "app's namespace"
approach above, your friend relies on that gateway to serve an
uncorrupted file and not to leak it. However your friend also has
option 2. use a local Tahoe-LAFS gateway running on his own computer
to retrieve the file. In this case the local gateway will perform the
integrity check, so your friend is not relying on any other person to
give them an uncorrupted file—they are relying solely on the
cryptographic integrity check for that.

> - how could the rest of the data, ie, all or part of the app's database, be also
>  stored in the grid (I guess the obvious answer would be "serializing and uploading
>  to grid", and then deserializing + syncing on the read side ?) and shared only
>  with selected end-users?

Yes! To my mind this is a more interesting architecture. Basically
just put all the code into Tahoe-LAFS in addition to the data. If you
can do that, then you get the integrity guarantees and fault-tolerance
for the code in addition to the data. That is interesting. However, as
far as I know the only widely deployed programming language
implementation which can easily fetch code over HTTP and execute it is
JavaScript in the modern web browser, so that means you have to write
all your code in JavaScript.

(Also this could probably be done with some of those newfangled "rich
internet application" systems like, um, Adobe Whatever and Microsoft
Whateverelse and SunOracle Alsoran.)

This is like the ideas of "couch apps" [1] that you mentioned, and the
Unhosted project [2]. My blog is the only running instance of this
paradigm that I'm aware of:

http://insecure.tahoe-lafs.org/uri/URI:DIR2-RO:ixqhc4kdbjxc7o65xjnveoewym:5x6lwoxghrd5rxhwunzavft2qygfkt27oj3fbxlq4c6p45z5uneq/blog.html

> - Do I need to come up with an extra communication layer to share the read-caps with
>  "friend" nodes, or could I somehow make use of the underlying DHT?

You can use the DHT. It might be too slow. Try it and see! As above,
webapi.rst ought to explain to you how you can do that, and if it
doesn't then we probably need a webapi_intro.rst to go in front of it.

> I say this because I was a bit confused when I discovered [1][2] that the the fs
> layer on tahoe-lafs is in fact build upon a distributed key-value store layer.
> Could this key-value store be used for other purposes than the top fs
> abstraction? (thinking about indexing and querying data chunks).
> I guess the answer might be no, being them non-human-meaningful?

Perhaps the term "distributed key-value store layer" throws more
shadow than light. There are two things that could be meant by it:

1. The way that a Tahoe-LAFS storage client (gateway) can connect to
servers and ask them "Do you have any shares of file XYZ?" or "Would
you please store some shares of file XYZ?".

2. The way that a Tahoe-LAFS wapi client can connect to a Tahoe-LAFS
gateway and ask it "What is the current value of mutable file XYZ?"

The first one is probably useful only for implementing the second one.
Whether the second one can be used for indexing and querying chunks of
data, in a similar way that document-oriented databases or
column-oriented databases can—probably! I'm not sure. Someone should
try it and report back. :-)

> I came through these questions thinking again about how something like diaspora could
> be ported to work on top of tahoe, and again, I see some conceptual barriers
> from my limited webdev optic: a key-value or document store is something I can
> readily query and filter on the fly, while a "file" is an abstraction I have to
> write/read as a unit, and process before building any complex app that needs to be able
> to filter/sort data in a efficient way.

Perhaps this is unnecessary baggage from the historical notion of a
"file"? Or perhaps there is some reason that document-oriented
databases really are easier to use for this than Tahoe-LAFS would be.
I don't know. What's the difference between a document in a
document-oriented database and a file in a filesystem?

There is one difference that I'm aware of, which is that in databases
like couchdb you can easily ship JavaScript to the server and have it
process or filter the file on the server, where with Tahoe-LAFS you
would typically ship the contents of the file to the client and have
it do the work there. This seems like "only" a performance issue to me
(I say "only" with scare-quotes because performance is important). I
have a few thoughts about that:

1. We should measure some specific application and see how much it
really matters whether it does the processing on the server or the
client.

2. In some cases the bandwidth and latency from the hard disk to the
locally-attached CPU is worse than the bandwidth and latency from that
CPU to another CPU on the network. (Sometimes your peers on the
network are closer to you than your hard disk is! (In
communications-performance space.))

3. If you *do* decide you want to do that computation on the
server-side instead of the client-side in Tahoe-LAFS, you can do that!
Just run a JavaScript interpreter on the server and give it the URL to
the code and the data. There are still a few performance questions
about erasure-coding and how it should deliver its results, but
there's no reason to think we couldn't get excellent performance from
this architecture if we tried.

4. If there are a lot more clients than servers then doing the
computation on the clients might be a performance win due to the
servers being more heavily loaded.

5. If you are used to the network connection from server to client
being an anemic consumer DSL link, or an around-the-globe internet
connection with dozens of hops, then you may think shipping your code
to the server and running it there with a 100 MB/s link to the hard
drive is great, and you may be in the habit of being "lazy" and
defining your application as doing lots of iteration over large data
sets ("table scans"). Big scans like that can be a performance
limitation even when you are running your code on the server—it means
that your app takes twice as long when it grows to have twice as much
data in that table, and you are saturating the link to the disk, so if
you have 100 MB of data then one user can query it in one second, but
ten users would require ten seconds. On the other hand if you design
your app to query data more precisely then it will perform better both
when executing on the server and when executing on the client.

6. All of the above is about performance optimization. Performance is
important, but you should start by just implementing whatever works so
that you have something to measure. Once you have measurements, then
we can figure out if the performance is acceptable or how to bring it
up to an acceptable level.


> besides, the couchapp diaspora port [3] seems very interesting by the couchdb builtin
> features for selective replication. I'd really like to contribute towards seing
> something similar based on strong crypto and distributedness, but as I explain
> in the lines above, I'm completely lost just by starting to think how to connect
> the html+js frontend to the storage grid, and which should be the role of the database
> in between (hmm something in the lines of what's discussed here [4]... is
> html/js <--> storage grid the only possible answer? )

I don't understand "selective replication". Hopefully now that you
understand more about the Tahoe-LAFS architecture you can explain it
to us. :-)

Regards,

Zooko

[1] http://couchapp.org/page/index
[2] http://www.unhosted.org/



More information about the tahoe-dev mailing list