[tahoe-dev] #534: "tahoe cp" command encoding issue

zooko zooko at zooko.com
Sat Feb 28 03:27:03 UTC 2009


Regarding my Strategy 2.d [1], François's Strategy 2.d&1/2 [2], and  
Alberto's Strategy 2.e [3], the question is what is more desirable  
for the case that there is a filename in a local filesystem which  
isn't actually a valid encoding in that filesystem's default codec,  
and that file gets "tahoe backup"'ed or "tahoe cp"''ed into a tahoe  
directory, and *then* an old or lazy tahoe client reads that filename  
out of a tahoe directory and gives it to you.  Do you want this old  
or lazy tahoe client to give you:

2.d:  Whatever that filename would have been if it had actually been  
encoded in latin-1 in the first place.  (I.e., some sort of  
gibberish, if it wasn't actually latin-1.)

2.d&1/2:  The same as 2.d, but prepended with the the U+FFFC char

2.e:  Whichever characters of that filename *are* legitimate for the  
filesystem's default codec, interspersed with U+FFFD "replacement  
characters" for any characters that aren't legitimate for the default  

I tend to think that the first of those three options is the best,  
but I would defer to any established "best practices" among unicode  
gurus.  Remember that we're only talking about backwards- 
compatibility here -- the behavior of old tahoe clients who don't  
know how to do anything but treat the "child name" as a unicode  
string.  Also lazy tahoe clients who don't bother to check for this  
condition and get the original bytes and do "Whatever it is that  
diligent clients are supposed to do with a bunch of bytes in some  
unknown encoding.".



[1] http://allmydata.org/pipermail/tahoe-dev/2009-February/001343.html
[2] http://allmydata.org/pipermail/tahoe-dev/2009-February/001346.html
[3] http://allmydata.org/pipermail/tahoe-dev/2009-February/001348.html
[4] http://en.wikipedia.org/wiki/Replacement_character

More information about the tahoe-dev mailing list