[tahoe-dev] #534: "tahoe cp" command encoding issue
alberto at metapensiero.it
Sat Feb 28 02:55:23 UTC 2009
>>>>> "Zooko" == zooko <zooko at zooko.com> writes:
Zooko> As I understand it from Shawn and Kevin, taking an arbitrary
Zooko> byte string and decoding it with latin-1 to produce a unicode
Zooko> object is lossless -- a subsequent encode of that unicode
Zooko> object with latin-1 will always yield the same bytes. Is
Zooko> that right?
I think so, there should be no reason why it shouldn't be biunivocal.
Zooko> So I propose Strategy 2.d (but who's counting?):
Zooko> Decode the filename with the declared encoding. If that
Zooko> succeeds, then put that unicode string (utf-8 encoded) into
Zooko> the child name and set the flag "latin_1_fallback: False".
Zooko> If that fails then decode the filename with latin-1 (which
Zooko> can't fail) then put that unicode string (utf-8 encoded) into
Zooko> the child name and set the flag "latin_1_fallback: True".
I mostly agree with this strategy, but i propose to slightly change it
in this way (strategy 2.e):
Decode the filename with the declared encoding. If that succeeds, put
utf-8 string into the child name and don't do nothing more. If the
first decoding fails, decode it with filename.decode('utf-8', 'replace')
that will replace invalid characters with the standard U+FFFD unicode
char and put this into the child name, then decode the original filename
again with latin1 and place that string in a "latin1_decoded" entry in
Old tahoe clients will behave like you explained in your 2.d.
Lazy or generic clients will behave similarly, but will be allowed to
know which characters were erroneus because of the substitution.
New and diligent tahoe clients will check for the presence of the
"latin1_decoded" entry in metadata and in such case use its value to
produce the original filename string using latin1 codec.
More information about the tahoe-dev