[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue

Sun May 3 19:25:51 UTC 2009

On May 3, 2009, at 11:48 AM, Shawn Willden wrote:

> On Sunday 03 May 2009 09:14:28 am tahoe-lafs wrote:
>>  2. On Linux or Solaris read the filename with the string APIs, and
>>  store the result in the "original_bytes" part of the metadata. Call
>>  sys.getfilesystemencoding() to get an alleged_encoding. Then, call
>>  bytes.decode(alleged_encoding, 'strict') to try to get a unicode
>>  object.
>
> Why not just read the filename with the unicode API?  That will  
> decode it
> using the file system encoding if possible, and if that decoding  
> fails you'll
> get a string object as a result, with the original bytes.  Then you  
> only have
> to bother with the "original_bytes", "failed_decode", etc. if the  
> file name
> is a string, rather than a unicode object.

It's because we want to bother with "original_bytes" even if the  
filenames decode correctly, on Linux or Solaris, but not on Windows  
or Mac.  (By the way, for what it is worth Python 3.0 changes this  
behavior to omit the non-decodable filenames entirely, and Python 3.1  
is slated to change this behavior again to emit invalid unicode for  
them instead...)

> If there is value in using the string API and then decoding in  
> scrict mode,
> then your approach makes perfect sense, except that I'd still  
> prefer to
> handle it the same way on all platforms, rather than special- 
> casing.  Reading
> with the string API and then strictly decoding with the file system  
> encoding
> should work just fine on Windows, too.

I think there is value in this only on byte-oriented filesystems such  
as Linux and Solaris.  Also in fact I don't think the Python API even  
allows one to reliably get the bytes from Windows filesystems.  In  
fact, I'm not even sure that the Windows API allows that.

Regards,

Zooko