[tahoe-dev] GSOC proposal---"100 year cryptography"

Fri Apr 16 15:56:37 UTC 2010

Dear Yu Xue:

Very good proposal! Here are a couple of comments.

2010/4/16 yu xue <xueyu7452 at gmail.com>:

> The cryptographic algorithm will be increasingly weaker because of the
> constant cryptanalysis.

Your summary of the problem and definition of a combiner was excellent.

> This part is mainly for symmetric ciphers such as block cipher and stream
> cipher. Take two ciphers as underlying primitives. For block cipher it need
> to specify an appropriate operation mode, such as CBC, CRT, or OFB etc.

Let's use CTR for a block cipher (i.e. AES) and combine with a stream
cipher (XSalsa20).

> Generate independent keys for each cipher.

This part is interesting. How do you generate independent keys? I
investigated HKDF and discussed it on the Crypto Forum Research Group
mailing list. Eventually I decided that HKDF was not solving the same
problem and that what we really want is simply: let subkey1 be the
first few bytes of the keystream of XSalsa20(masterkey) and let
subkey2 be the next few bytes of the same keystream.

> combiners mentioned that we can implement. The basic one is Comb4P which can
> preserve four properties ( CR, MAC, TCR, PRF) but cannot preserve IRO. There
> are many crypto schemes that are proved secure in the Random Oracle Model.
> So the indifferentiability property is important. If C is indifferentiable
> from a random oracle then C can replace the random oracle in any
> cryptosystem.

I had thought that Comb4P was good enough: relatively simple,
relatively CPU-efficient, and had the admirable property of output
size only 2N.

While I appreciate IRO for its theoretical importance, I had been
thinking that I don't want to pay a big cost in code complexity and
output size to have IRO in real life. For one thing, Tahoe-LAFS
probably requires only collision-resistance from its hash function.
(Although I wouldn't be surprised if people end up relying on
Tahoe-LAFS capabilities to be random-looking, too.)

I don't even understand how to implement the more sophisticated
combiners in [1], such as Comb4P&OW, Comb4P&IRO, or Comb6P.

For that matter, perhaps the ideal combiner for Tahoe-LAFS's purposes
is the dumb old concatenation combiner!

However, if you think the more sophisticated combiners are valuable
and you know how to implement them, perhaps we could make exploring
those options be part of the "above and beyond" work for the summer
project.

The basic summer project goal would be a thoroughly well-implemented,
tested, documented, measured, packaged implementation of the simple
stream cipher combiner and the Comb4P hash function combiner, and then
the "above and beyond" work would be to implement and measure the
concatenation combiner and the other combiners from Lehmann's thesis.

Note that I never like to offer alternatives to programmers or users,
so ultimately Tahoe-LAFS v2 will support only one way to hash things.
That one ultimate hash function in Tahoe-LAFS v2 will have to be
parallelizable (i.e. a tree hash) in order not to prevent future
implementations from parallelizing, it will have to be efficient in
CPU in order not to waste users's time or battery power too much, it
will have to be memory efficient in order not to preclude tiny cheap
embedded implementations, and it will of course have to be as strong
as we can make it to minimize concerns about long-term safety. On top
of all this it needs to be as simple as possible to specify and to
implement so that we won't have as many bugs and it will be easier for
us and others to reimplement.

I would also offer this same hash function to other programmers who
want to use it in projects other than Tahoe-LAFS, possibly by making
it a supported API in the pycryptopp library.

Anyway I'm not 100% sure at this time which combiner should be the
basis of this ultimate hash function. I would lean toward Comp4P as a
good balance between simplicity, efficiency, and strong security
properties. However, actually implementing and measuring some of the
other ones would be an interesting exercise. It would give us an
understanding of how complex their implementations are, for example,
as well as letting us measure their concrete performance. In
particular, the performance within the context of a tree-hash mode of
operation is potentially different than the performance in the
traditional context of a (potentially long) stream hash. The hash
function which is used by the tree hash will probably never be asked
to hash more than 512 KiB at a time, but it will probably often be
asked to hash 128 bytes or 256 bytes at a time.

> 3. all kinds of test harness, unit tests, test vectors
>
> About 2-3 weeks
>
> Do unit tests and test vectors. Write test harness to exercise the new
> components as much as possible. Thoroughly test the components. The unit
> tests need to be done when each feature is complemented.

Let's do this in the "test first" paradigm. The rule of test-first is:
you're not allowed to write a line of code until you have a fully
automated test case which runs and fails. Then you write code until
you can get that test case to pass, at which point you have to stop
writing code and go back to adding tests until you have some more test
cases that fail.

So move the 2-3 weeks of writing tests to the beginning of the period,
and hopefully we'll also interleave the writing of tests with the
writing of the code-under-test so that there won't be separate weeks
or even days in which we do only one and not the other. In addition to
unit tests and test vectors, let's also build other automated
evaluation tools in from the start: code-coverage metrics, bug hunting
tools such as valgrind and Adam Langely's sweet ctgrind hack [2],
benchmarks (not that we're going to focus on optimizing the benchmark
performance -- just that we want to have a measurement), and a
buildbot farm so that we can run all of these automated tools on
multiple machines and aggregate their results.

That may sound intimidating but much of the infrastructure is already
in place! The pycryptopp project already has a buildbot farm [3].
(Granted, it needs some tending on the part of me, Brian Warner, and a
few buildbot owners, but it is there and it is working.) The current
code in pycryptopp already comes with unit tests and test vectors, it
automatically runs the unit tests under valgrind on platforms where
valgrind is available. So starting from this it shouldn't be too much
work to add more tools such as code coverage measurements, ctgrind,
benchmarks, etc.. I'm willing to do the code coverage metrics and the
buildbot farming personally. Hopefully someone else will volunteer to
write benchmarks.

> 4. documentation and specification
>
> About 1 week
>
> Write related documentation and specification for afterwards maintenance and
> modification etc. Including source codes, design rational, structure of
> components etc.

Likewise, let's move this up to the beginning. You should write the
documentation concerning the specification and the API before writing
the code. (This means you'll end up having to change the docs as you
change the API.) You should write the documentation concerning the
design of the code as you write the code itself. For one thing, this
will enable others who are following along to iteratively contribute
to the specification and design, to work on alternative
implementations (such as a pure-Python or pure-JavaScript
implementation) and so on.

Hey, I'm excited about this project! :-)

Regards,

Zooko

[1] http://tuprints.ulb.tu-darmstadt.de/2094/1/thesis.lehmann.pdf
[2] http://github.com/agl/ctgrind/
[3] http://allmydata.org/buildbot-pycryptopp/waterfall