[tahoe-dev] String encoding in tahoe

Tue Dec 23 22:36:25 UTC 2008

Dear François:

What you write sounds reasonable, but I'm not sure precisely how it  
would be implemented.  We continue to run .decode('utf-8') on  
incoming strings, allowing an exception to stop the Python  
interpreter if the input can't be utf-8 decoded?  The only worry with  
that is that it is possible that the input accidentally matches a  
utf-8 pattern, so it thinks that it decoded it successfully, but it  
got a random gibberish string instead of the intended string.

The other thing that concerns me is that more of the buildbots were  
green before our recent patches.  Does this mean that if we revert  
those patches then the tahoe cli will work with non-ascii filenames  
on Windows?  Or does it mean that the tests were incorrectly marking  
Windows as green last week but actually non-ascii filenames wouldn't  
have worked on Windows?

I need to decide what to do for Tahoe-1.3.0, and what we had last  
week -- where everything passed your tests except for Ubuntu Feisty  
-- seems preferable to what we have today.

If you could tell me precisely what does/doesn't work on what  
platforms, then we could write it down in the known_issues.txt file,  
and we could put SKIP or TODO marks on the appropriate unit tests so  
that the buildbot is green.  I would be very grateful for any help on  
diagnosing and documenting the unicode situation for the 1.3.0  
release.  Allmydata.com doesn't use the cli on Windows currently, so  
the company probably isn't going to spend too much resources on that  
particular feature.

Oh, and another idea would be to override the sys.setdefaultencoding  
to be utf-8 instead of ascii.  Would that be a good idea?

Regards,

Zooko