[tahoe-dev] dynamic ip

Mon Jun 17 00:44:04 UTC 2013

>>> > where one entity controls all of the nodes
>>>
>>> I this this is mostly not really true and not really important.   There
>>> are some issues which sort of relate, but they all feel independent.
>>
>> There were a few bumps in the road related to node ownership. One of
>> them was the variable number of nodes on the network at any given
>> time. When participants and their nodes join and leave the network
>> freely, that does create churn, but it also creates a change in the
>> number of available nodes over time.
>
> True, but central  control and reliable participation are different.
> I still think coping with  coming/going is the key issue.

I agree.

>> If the number of nodes in the network drops below N, space is
>> wasted.
>
> I don't really see this as true, but I guess you are arguing that
> redundancy with correlated failures is not really the redundancy you
> think you have.  Still, I think the notion that you need N and
> centralized control are separate.

I'm arguing that redundancy with correlated failures is not the
redundancy that I want. In fact, I can't imagine a situation where it
would be desirable to have H < N, except to allow uploads to succeed
when there isn't reliable participation (i.e. H <= nodes online < N).
But, in that case, I think the absence of reliable participation is
the real problem.

>> If the number drops below H, uploads fail. So it seems
>> beneficial to set N and H to lower numbers.  On the other hand, for a
>> given ratio of N to K, larger values overall increase the resilience
>> of the uploads, so it would seem beneficial to set N (and H, and
>> probably K) to higher numbers. Finding the right balance requires
>> knowing in advance how many nodes are going to be available in the
>> long term, and that's hard to do when the nodes are run by people with
>> their own needs, motivations, etc.
>
> That's a fair point.   But I'm not sure how big a deal it is to get
> close to optimal.

Not finding the right parameter would result either in an increased
likelihood of data loss or in upload failures. Either of those is a
big deal to me, and I suspect it is for most other Tahoe
friendnet-type users. I mean, you have to be pretty paranoid if
backing up your data to an external hard drive and taking it to
another building isn't enough.

>> I would forecast the reliability of a friendnet somewhere in between
>> an unstable company likely to fold and shut down all its servers at
>> any time and a stable company likely (we hope) to keep operating all
>> of those servers forever. So, I think friendnet configurations can do
>> better in this area.
>
> It's not clear any companies are really stable.  The beauty of a
> friendnet is that with regular deep-check --repair --add-lease, you can
> survive lots of uncorrelated failures.

I did not find this to be true. Setting aside the fact that none of my
uploads ever completed (with the exception of a single, small test
file), other participants reported that multiple sequential deep-check
--repair --add-lease runs would result in repairs being needed, when
the second and subsequent runs should have been free of errors.

>>> > each node has a static IP
>>>
>>> My experience has been that the introducer needs to have static address,
>>> but that storage nodes and clients do not.  Storage nodes do need to
>>> have a globally-routable address, but that's different.
>>
>> I think even the introducer may not technically need a static IP to
>> keep the network going if it has dynamic DNS. However, all nodes need
>> either a static IP or dynamic DNS to find each other (or did at the
>> time I was participating in VG2). That's something else for each node
>> operator to buy or maintain, respectively.
>
> True, introducer just needs dns name.  But I bet there are issues
> related to frequent changing of addresses.
>
> I don't understand how storage nodes need to have dynamic DNS.  Each
> node connects to the introducer, and registers an IP address, and others
> connect.

I may be remembering incorrectly, but I think the hostname advertised
to the introducer was set in the configuration file. It may have been
possible to use an IP for that configuration file parameter instead of
a hostname. However, a dynamic IP would then have required rewriting
the configuration file each time the IP changed.

>> I phrased that badly. I was trying to talk about the amount of data
>> sent from an uploading client node being N/K times the size of the
>> file being uploaded, because upstream bandwidth *should be* (and often
>> is) the limiting factor in tahoe's performance in a friendnet
>> environment. On the other hand, if I'm uploading to a grid of storage
>> nodes operated by a business that are all interconnected at 100mbit,
>> and that business provides an upload helper, uploads over the
>> connection between my node and the helper (the slowest link) won't be
>> multiplied by N/K yet, speeding up the entire process.
>
> True.  I guess I see the N/K overhead as fundamentally built in, and
> don't worry about it.   If tahoe were to the point that my uplink
> capacity were limiting, that would be great.

The N/K overhead is not fundamentally built in and can be avoided with
an upload helper (when considering the amount of data transferred over
the link between the uploader and the upload helper). And yes, Tahoe
rarely maxed out my upstream connection too.

>> There are also subtler issues. While I haven't dug very deeply into
>> the code, it was my understanding that at the time of VG2, a Tahoe
>> node processing an upload would divide an upload into chunks and
>> upload the chunks serially. That is, it would only begin the upload of
>> chunk 2 to host 2 after the upload of chunk 1 to host 1 was complete.
>>
>> This makes sense when all of the storage nodes and upload helpers are
>> connected together with a fast ethernet switch: an uploading node
>> would saturate its own interface to the switch while sending a single
>> chunk to a single switch, requiring no optimization. On the Internet,
>> if the connection between my node and a friend's node is poor, my node
>> is going to leave most of my most precious resource (upstream
>> bandwidth) unused while taking a long time to finish uploading that
>> chunk.
>
> I suspect any seralization is accidental and a result of simple code
> rather than an intentional design decision.

I definitely agree. However, I brought this up because it is an
example of Tahoe's focus on non-friendnet use cases. I originally
wrote that part in response to this line:

> the only problem I face is to "find" a storage node if the tahoe proces doesn't
> start or if for some reason crashes since the status page doesn't show the last
> valid ip of offline nodes, just the ip of online nodes

It doesn't show the last valid IP of offline nodes because that
feature wasn't ever considered to be necessary due to the focus on
non-friendnet use cases. I guess what I should have written was: "If
you want to use Tahoe for anything that isn't an allmydata.com-type
scenario, be prepared to get your hands dirty in the code or writing
your own support programs before achieving the results you want."

>>>   accounting, so you can have some measure of fairness (even among
>>>   friends who are trying to be reasonable, you need a way to know if
>>>   you've accidentally consuemed 10x what you thought you had)
>>>
>>>   expiration/garbage-collection.  There needs to be a way for old shares
>>>   to go away, but it needs to be safe against normal activities, and
>>>   safe against vanishing for a few months.
>>
>> I may be naive here, but I believe both of these problems can be
>> solved by looking to traditional filesystems. Each filesystem object
>> has an owner--that makes it possible and easy to do accounting. Right
>> now objects are somewhat anonymous, which I don't see as an advantage
>> in any of Tahoe's use cases. If you need to distribute data to people
>> anonymously I think a model like Freenet's would provide better
>> protection.
>
> Ah, so you want to have a model where shares have owners (with an owener
> key) and you can enumerate your shares.  That actually would help.  The
> basic reason thaoe has to be different is because the storage nodes
> cannot read the directory structure, so all tree walking has to be done
> by a (red) client.

I tentatively want this. I understand that the Tahoe devs are smart
and have done a lot of thinking about the security implications of
various features before going ahead with them. I would say I want
shares to have owners unless there is some problem I've managed not to
see.

>> The necessity for garbage collection IMHO comes from the fact that
>> it's possible to lose or forget the root of a directory structure. Why
>> not use the dropbox model, where it's just like another drive with a
>> single root per user?
>
> The problem is that the root cap is crytopgraphically necessary, and can
> still be lost.   So I don't see getting out of this without losing the
> security properties.

I think storage nodes should be able to read the directory structure.
That's the way ecryptfs does it (the technology underlying encrypted
home directories in Ubuntu), and it seems to work well without
sacrificing privacy.

Thanks,
Eric