[tahoe-dev] storage-club URLs

Wed Feb 23 01:10:01 UTC 2011

On 2/22/11 3:38 PM, Greg Troxel wrote:
> 
> I understand your point about how grids might be organized, but I
> don't follow "moving to". The pubgrid is anomalous, and there are
> volunteergrid, volunteergrid2, plus numerous unadvertised grids. So it
> seems like we're already there.

Yeah, I'm talking about making "membership in a grid" to be more
distinct. We currently have no grid IDs, just the introducer.furl, and
we really want to get rid of that. When there is no introducer to
control things (i.e. its functionality is distributed out among all
members of the grid), we need some other definition of membership, which
means things like:

 1: which servers I, as a client, should trust with my shares
 2: which clients I, as a server, should accept shares from
 3: which gateways I, as a downloader, should get shares from

>   tahoe invite user at example.com
> 
> which packages up the grid params and sends an OpenPGP signed and
> encrypted mail with the data, then that sounds cool; now you have to
> do that by hand.

Exactly. The "invitation" idea is just that, except hopefully "tahoe
invite" will return a single short "invitation code", and the recipient
will paste it in with "tahoe accept-invite $CODE", and then the two
nodes will find each other and exchange keys and whatnot and eventually
know about each other and everyone transitively connected to them. (this
depends briefly upon having a pre-shared broadcast channel, maybe
through a tahoe-lafs.org coordination server).

> Here I was initially very skeptical, as I have never really understood
> the tahoe community's conflation of storage and publishing. Perhaps
> that's because
> 
>   * I view local storage as essentially free, and reliable and
>     survivable storage as hard.
> 
>   * I associate the "publishing in tahoe" approach with using a
>     particular pubgrid gateway, so it isn't any more reliable than a
>     traditiional server

Yeah, that's the thing I want to fix. The big obvious problem with
publishing URLs that start with http://pubgrid.tahoe-lafs.org/ is that
they depend upon that one webapi host, in addition to a quorum of
storage servers, and DNS. Those URLs are *less* reliable than an
ordinary apache server with a local file in /var/www .

OTOH, a tahoe filecap (if you have the software to use it) is much more
reliable than that URL: you don't need DNS, you run your own gateway,
and you only need a quorum of storage servers to be reachable.

The URLs I'm proposing *could* be more reliable than
http://pubgrid.tahoe-lafs.org/ URLs. On the plus side, there could be
multiple gateways. On the minus side, there's the DNS dispatcher.

>   * I have the impression the publish-in-tahoe activities to date are
>     as much a tahoe marketing activity as they are genuinely useful.

Yeah. The dream of an "unhosted wiki" or "cloudapp" depends upon a
protocol that lets you transparently fail over to alternate servers,
which ordinary browsers can already speak. I think round-robin DNS
records is the closest we currently have. Maybe someone will figure out
how to run HTTP over IPv6 anycast addresses or something..

Anyways, one of my unstated goals is to make it easier for individuals
to publish data on the web. Today, if you have something you want to
share with the world (or even just some friends), how do you do it?
Personally, I'd probably scp it up to www.lothar.com, and hand out the
resulting URL, but I'm lucky/crazy/stupid enough to pay a nontrivial
amount of money each month to rent a box in colo with 24/7 connectivity.
Most people would give it to Facebook and ask them for a URL to give to
people. Or they'd look at the type of the thing being shared and put the
images on flickr, or the spreadsheets on Google Docs, or use one of a
few dozen datatype-specific hosting providers, all of which start by
requiring an account setup process, show you ads or crawl your content
or find other ways to recoup the costs involved in hosting your stuff.
And all of them are susceptible to control by somebody you've never met.

One of the costs of hosting that data is the disk space, but as you
pointed out, local disk is cheap. Another cost is outbound bandwidth,
but everybody connected to the internet has a little bit. Another cost
is having a server available at the same time your potential downloader
wants to see your file, which can be addressed with either a dedicated
machine (maybe a small guru/pogo/shiva-plug), or a collection of
machines that hand off responsibility over the course of the day.

And the biggest cost, particularly for small sites, is organizational:
the tools to manage the server, providing the easy-to-use upload form,
the search forms, the how-to-delete-my-files forms, the edit-my-pictures
forms. You have to pay this cost before you get a single file online, so
for an individual who just wants to share pictures of their grandkids,
it dwarfs all the others. My grandparents would never build a linux box
and rent colo space to show me their pictures. But they might install an
easy-to-use Tahoe Storage Club(tm) program.

And, as I've mentioned before, I think we can address the limited
bandwidth/uptime of home machines by letting people augment their grids
with rented professional storage. But for small usage, that's hardly a
requirement. And I think there are a lot of people who would like to be
able to host data all by themselves, or collectively with a couple of
friends, and retain control over it (i.e. minimize external
dependencies). A bunch of the solutions depend upon a DNS dispatcher,
but the cost to run that are pretty low, so I think we fund one with
donations and not wind up with a users-are-product parasitic ad-based
service like almost every dot.com out there.

In summary, I think I'm looking to provide for the low-end of the
publishing spectrum, which is currently expensive enough that most
people are forced to use a "free" service (ironically enough).

> After reading the rest of your message, I think what you're proposing
> makes more sense, and it needs situations with one of two properties:
> 
>   government suppression of publication
> 
>   massive datasets with infrequent access, such that putting them on a
>   regular server is infeasible.
> 
> I think the first use case is more compelling, and then the entire
> system has to be designed against that threat model.
> 
> I wonder if first addressing distributed introducers is in order; it
> would seem that some DHT-type scheme for within-grid discovery might
> also work for the publication scenario, and the tahoe-lafs.org dyndns
> server becomes a single point of failure -- your scheme would not have
> worked in Egypt.

Yeah, I think the attack-tolerance depends upon how much software your
clients are willing to install. If you had IP connectivity but not DNS,
then a big DHT could be used to build an overlay network that gets you
to all the storage servers, and then you can do normal tahoe downloads
over that. A vanilla browser won't be able to take advantage of that,
but if we could make a Firefox plugin that knows how to speak Tahoe and
this DHT, then maybe we could survive stuff like DNS takedowns.

I suspect that massive datasets are going to need serious servers.

> There's another thorny problem lurking, which is that storage
> accounting isn't sufficient. I've been talking to people as I head
> towards a private grid, and one person was concerned about network
> usage. Someone here recently expressed concern about a 60G/month usage
> cap, and this is an obvious issue when shopping for a VPS provider. So
> one really needs accounting for total data transfers.

Yeah, good point. Part of the collection-of-gateways scheme needs to
keep track of how much bandwidth is used by people downloading different
files (indexed by the publisher), so the participants can monitor and
control how their machines are being used. Bandwidth is another commons
that needs to be managed, just like storage space. I'll add that to my
notes.

> I wonder if your proposal has somewhat the same properties as a
> regular website as a tor hidden service. Instead, the server is
> distributed and it's not imediately obvious who put the bits there.

Hm, yeah. Like Tor hidden services, access depends upon the
contributions of many people (the various gateways you end up
traversing). Tor is running in "one grid to rule them all" -style, with
a number of gateway-reliability monitoring tools to try and apply some
mechanical and/or social pressure to keep the grid working well. I
should read up on what their tools are like and how well they're coping
with the tradgedy-of-the-commons effect that large groups and anonymity
usually cause.

thanks!
 -Brian