[tahoe-dev] Unicode issues review

zooko zooko at zooko.com
Tue Feb 17 18:09:09 UTC 2009


On Feb 17, 2009, at 11:03 AM, Jan-Benedict Glaw wrote:

> This happens.
...
> So technically, "encoding" is a per-file property on some  
> filesystems (those that don't care about a filename's contents, as  
> long as it doesn't contain the directory delimiter (typically '/'  
> or '\\') or the '\0' (end of string)).

Ugh.  This would be fine if the filesystem stored and provided the  
information about what encoding was used for each name, but I'm  
betting they don't do that.  :-)

So, what should Tahoe do?

1.  Always treat filenames as opaque blobs.  This means Tahoe is  
losing information that some filesystems (e.g. NTFS) provide, and  
making it harder for users on the other side of Tahoe to  
unambiguously decode those filenames.

2.  If the filesystem guarantees a specific encoding, use that one,  
else treat the filename as an opaque blob.

3.  If the filesystem guarantees a specific encoding, use that one,  
else if it provides a "default" encoding, then try to decode with  
that one, and if decoding fails then reject the filename and ask the  
user to fix it up.

3.b.  ... and if decoding fails then treat the filename as an opaque  
blob.

3.c.  ... and if decoding fails then try to decode it with a few  
dozen of our favorite encodings in descending order of popularity ...

4.  Any other options?

Thanks!

Zooko



More information about the tahoe-dev mailing list