Status update on Tahoe LAFS integration with Nextcloud

Wed Feb 12 20:55:06 UTC 2020

Jean-Paul,

Thanks for the feedback. I will try to isolate the problem.

I also have some additional information which may be useful in determining where the issue is.

When the Tahoe state machine hangs on a node, it is in one of 2 states:
No transfer taking place. When the node is in this state, I can see that there are is no data being transferred at the TCP layer. The Tahoe transfer state machine hangs for a period of time until an upper layer timeout occurs.
Active transfer taking place. When the node is in this state, I can see that data is being transferred from one node to another at the TCP layer. So TCP is in fact not hung. However, there must be some type of state machine in Tahoe above the TCP layer that breaks the transfer into chunks. The Tahoe transfer state machine displayed using “tahoe status” again does not progress. When Tahoe is in this state, it does not seem to time out. This state is much more problematic since the node can end up transferring many Gigabytes of data before the issue is addressed.

Not sure if this additional information may provide insight on where I should look for the issue.

                Bruce T

From: Jean-Paul Calderone <jean-paul+tahoe-dev at leastauthority.com>
Date: Wednesday, February 12, 2020 at 9:35 AM
To: Bruce Thompson <brucet.cisco at gmail.com>
Cc: "tahoe-dev at tahoe-lafs.org" <tahoe-dev at tahoe-lafs.org>, eduardo gonzalez <eduardogonzalez at ged-innovations.com>, Brian Thompson <brianbthompson at sbcglobal.net>
Subject: Re: Status update on Tahoe LAFS integration with Nextcloud

On Mon, Feb 10, 2020 at 6:53 PM brucet <brucet.cisco at gmail.com> wrote:

Jean-Paul et al,

I just wanted to give you a status update on the progress I have made on creating a Nextcloud filesystem plugin for Tahoe-LAFS.

The good news is that the plugin is fully functional and seems to work quite well with Tahoe LAFS. The plugin maps a either Tahoe URI or a Tahoe alias / path to a Nextcloud mount point. The plugin also detects Tahoe magic-folder configurations and maps the resulting Tahoe magic-folder directory to a Nextcloud mount point. I have tested the plugin using both the Nextcloud web interface and using Nextcloud clients (wabdav). Both interfaces seem to work reliably (with a caveat).

Heya Bruce,

That's quite cool.

The bad news is that I am having intermittent trouble with Tahoe mutable files which I believe are used to hold file directories in the Tahoe filesystem. I have multiple Tahoe groups running right now each with its own introducer. Each Tahoe group has about 8 nodes in it. The nodes which are members of the Tahoe groups are always behind a NAT. I use the frp package (https://github.com/fatedier/frp) to create tunnels to a server with a public IP to get around the NAT issue. The result is that bandwidth between nodes can be reasonable low (< 1Mbps) and RTT can be quite high.

What I am finding is that when I use the Tahoe filesystem heavily, the Tahoe client state machine will often hang for about 5 – 10 minutes before some type of timeout occurs and data transfers continue. Here is an example of a Tahoe status screen I see when the Tahoe client state machine is hung:

All of the Storage Indexes above are mutable files which I believe hold Tahoe filesystem directories. When a client hangs up, it stays in this state for a significant period of time (5 – 10 minutes) before timing out and continuing on. I have found that once the Tahoe client is in this state, it will not operate properly until I restart the Tahoe node.

I have found multiple nodes in my network which exhibit similar behavior.

I am looking at the Tahoe code right now to try to understand the details of the mutable files subsystem and what may be causing this issue. Any pointers / feedback would be greatly appreciated.

Hm.  I have an intuition that network level timeouts are probably not handled very gracefully throughout Tahoe-LAFS but I can't put my finger on anything in particular.  I don't think I have any concrete ideas about what might be going wrong here, unfortunately.  It sounds like a reliability issue that would be great to resolve.  I wonder if there are any simpler deployment configurations that could be made to replicate the misbehavior.  My approach to tracking down the problem would probably involve setting up an environment where I can easily replicate the behavior and then heavily instrumenting the implementation with improved logging (probably using Eliot) until I had enough information available to make it clear what's happening.

Jean-Paul

            Bruce T

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20200212/d3ea1652/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 257289 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20200212/d3ea1652/attachment.png>