[tahoe-dev] [tahoe-lafs] #534: "tahoe cp" command encoding issue

Alberto Berti alberto at metapensiero.it
Fri Feb 27 01:24:52 UTC 2009


>>>>> "Andrej" == Andrej Falout <andrej at falout.org> writes:
    Andrej> 1) Files are on the file system, which thinks they have
    Andrej> valid file names. 

    Andrej> 2) Samba, Windows and CD-ROM/DVD file systems think the same

    Andrej> 3) I can backup this files as they are successfully,
    Andrej> using many different backup/Sync tools, including tar,
    Andrej> Dropbox, MS Mesh, SyncToy, Unison, rsync 

    Andrej> 4) I can even back them up using 'tahoe cp' and Tahoe
    Andrej> Windows backup client, and did so several times (I was using
    Andrej> this same fileset to test performance)

    Andrej> So I wonder, how come I need to "fix" them to use 'tahoe
    Andrej> backup' only?

    Andrej> "ls" and nautilus/konqueror displays gibberish in this
    Andrej> particular file, yes, but when mounted on Windows via Samba,
    Andrej> file name is displayed correctly.

    Andrej> I was hoping that the patch had the intention of allowing
    Andrej> all valid encoded file names to be backed up?

I truly understand your points. I just want to comment on the examples
you made and share some bits of my small experience.  About your
examples, it appears that or the filesystem is rather permissive in what
it allows the filename to be and/or that in some cases at the
"application level" there is a failure in giving to the filename the
same meaningfullness as when it was named the first time.
I've had encoding problems with samba myself, when upgrading servers or
tranferring filesystems from the older server to the new one. It's true
that there is a looong history of problems with encoding/decoding of
path names on samba's mailing lists so this problem is or was a big
issue even them:) 
Most of the time i had to set the "unix charset" option in smb.conf to
see them as expected again on clients.

It seems to me that we are discussing two issues in one here, how should
tahoe dirs store filenames and what should be the right behavior for the
backup command.

Thinking at tahoe and and the fact that two of its goals are
preservation (for the future) and also sharing, my opinion is that
the best solution is to have them stored in a uniform and unicode aware
charset. I don't think that more permissive form of storage will help on
the long run.
>From another point of view, a backup tool is expected to make its best
to restore the previous state (and meaningfulness) of files and
directories so i don't think it should stop when backupping weird
filenames that i can't decode without errors to unicode, but instead
save the filename with the unicode aware charset by replacing or
escaping and maybe add a note somewhere containing the readed fs
bytestring. Where, i still don't know. Maybe it's worth to remember that
json directory metadata are encoded to utf-8 as well on the wire.

my two cents.

Alberto




More information about the tahoe-dev mailing list