[tahoe-dev] (no subject)

Jack Lloyd lloyd at randombit.net
Mon Feb 2 15:56:21 UTC 2009

Reed-Solomon encoding using Vandermonde matrices (as used in Rizzo's
fec and in zfec) requires many multiplications in GF(2^8). Profiling
with valgrind showed about 90% of the total runtime was in the
addmul() function.

Multiplication in GF(2^8) can be done using a double-and-add algorithm
(it works in other fields as well, such as Z/pZ, where we call this
algorithm the right to left binary modular exponentation). But the
general thought is that this is slower than table lookups. This is
certainly true in the case where the function is computed a byte at a
time, but I realized the operations used could be mapped to SSE2
operations, allowing 16 bytes to be processed in parallel.

I did a short writeup about my results
and Zooko asked if it would be possible to include the SSE2 code
(which I tested as being about twice as fast as a table lookup on a
Core2) in zfec.

It turned out to be very easy to include the code in zfec (not too
surprising as they do share a common codebase, though both have
changed internally in major ways). For a simple benchmark to test that
everything was OK, I wrote code that would read a file, in 4K chunks
at a time, and encode it FEC parameters (3,10). I'm not sure if 4K
is really that optimal here, it was chosen mostly because that is the
chunk size filefec.py uses.

I ran the encoders on a 2.4 GHz Core2 (Gentoo Linux), using as input a
512 MiB file generated by /dev/urandom. The file was pulled into cache
by md5sum'ing it before I ran the encoder. I ran each test several
times andd got quite consistent numbers.

In C++, SSE2 was much faster than a 64K table:

        table      sse2
real    5.021s     2.376s
user    4.797s     2.202s
sys     0.197s     0.174s

But when I moved the SSE2 into zfec and did the encoding via Python, I
was surprised. Not only is overall runtime almost 3 times as long, the
SSE2 version was 25% slower:

        table      sse2
real    14.435s    18.159s
user    14.205s    17.925s
sys     0.229s     0.227s

In an attempt to narrow down the problem, I wrote a program to access
zfec's C API. Which has runtimes very close to the C++ code:

        table      sse2
real    5.004s     2.323s
user    4.837s     2.163s
sys     0.165s     0.159s

So, apparently the problem is somewhere between fec.c and Python, but
where? In attempt to narrow things down, I wrapped my C++ FEC code
using Boost.Python:

        table      sse2
real    13.735s    17.295s
user    13.516s    17.074s
sys     0.217s     0.217s

which is oddly consistent with the numbers using zfec (which directly
uses the Python C API instead of Boost.Python, and uses alloca()
instead of std::vector, and is generally completely different in terms
of style). I also checked simply reading the file from Python, without
encoding anything: that took less than a quarter of a second, mostly
system time.

So, why is it that using SSE2 is faster than a table lookup in C/C++,
but 25% slower from Python? Zooko suggested cache contention, which
seems plausible. Using the SSE2 version removes the need for 64
kibibytes worth of multiplication tables, if anything leaving more
room for input (and the Python interpreter data) in the L1/L2 data
cache. However the instruction size of the SSE2 version is much
larger; it may be fitting in the L1 I-cache with C/C++ (no runtimes to
speak of), but causing blowouts when it has to compete with the Python

I tried using oprofile, but about half an hour into trying things out,
I got:

Filesystem "sdc1": XFS internal error xfs_trans_cancel at line 1163 of file fs/xfs/xfs_trans.c.  Caller 0xffffffff803444a0
Filesystem "sdc1": Corruption of in-memory data detected.  Shutting down filesystem: sdc1

Which is the first time I've seen anything like this in several years
of using XFS. Fortunately this was just the scratch partition, and
seemed unharmed upon reboot, but for the moment this has scared me off
from using oprofile further. (Maybe I can run another Linux instance
inside KVM and oprofile zfec from there?)

So I will carry on using valgrind+kcachegrind and blind guessing. One
thing to try, to test the I-cache blowout theory, would be de-rolling
the loop from 64 bytes at a time to 16 or 32. If the problem is the
instruction cache, that will help - though it may end up making it
slower than just using table lookups.

I've attached a patch for zfec that adds SSE2, as well as the encoders
I used in C and Python. This patch does seem worth using if one is
using zfec from C, though my impression is that most people access
zfec via the Python module. I have not tested to see which method
faster from Haskell; those results might give another hint.

-------------- next part --------------
--- old-trunk/zfec/zfec/fec.c	2009-02-02 09:31:33.677104495 -0500
+++ new-trunk/zfec/zfec/fec.c	2009-02-02 09:31:33.678104549 -0500
@@ -8,6 +8,11 @@
 #include <stdlib.h>
 #include <string.h>
 #include <assert.h>
+#include <stdint.h>
+#if defined(__SSE2__)
+  #include <emmintrin.h>
  * Primitive polynomials - see Lin & Costello, Appendix A,
@@ -169,6 +174,101 @@
 #define UNROLL 16               /* 1, 4, 8, 16 */
 static void
 _addmul1(register gf*restrict dst, const register gf*restrict src, gf c, size_t sz) {
+#if defined(__SSE2__)
+    GF_MULC0 (c);
+    while(sz && (uintptr_t)dst % 16) // first align dst to 16 bytes
+      {
+      GF_ADDMULC(dst[0], src[0]);
+      ++dst;
+      ++src;
+      sz--;
+      }
+   const __m128i polynomial = _mm_set1_epi8(0x1D);
+   const __m128i all_zeros = _mm_setzero_si128();
+   const size_t y_bits = 32 - __builtin_clz(c);
+   // unrolled out to cache line size
+   while(sz >= 64)
+      {
+      __m128i x_1 = _mm_loadu_si128((const __m128i*)(src));
+      __m128i x_2 = _mm_loadu_si128((const __m128i*)(src + 16));
+      __m128i x_3 = _mm_loadu_si128((const __m128i*)(src + 32));
+      __m128i x_4 = _mm_loadu_si128((const __m128i*)(src + 48));
+      __m128i z_1 = _mm_load_si128((const __m128i*)(dst));
+      __m128i z_2 = _mm_load_si128((const __m128i*)(dst + 16));
+      __m128i z_3 = _mm_load_si128((const __m128i*)(dst + 32));
+      __m128i z_4 = _mm_load_si128((const __m128i*)(dst + 48));
+      // prefetch next two src and dst blocks
+      _mm_prefetch(src + 64, _MM_HINT_T0);
+      _mm_prefetch(dst + 64, _MM_HINT_T0);
+      _mm_prefetch(src + 128, _MM_HINT_T1);
+      _mm_prefetch(dst + 128, _MM_HINT_T1);
+      if(c & 0x01)
+         {
+         z_1 = _mm_xor_si128(z_1, x_1);
+         z_2 = _mm_xor_si128(z_2, x_2);
+         z_3 = _mm_xor_si128(z_3, x_3);
+         z_4 = _mm_xor_si128(z_4, x_4);
+         }
+      for(size_t j = 1; j != y_bits; ++j)
+         {
+         /*
+         * Each byte of each mask is either 0 or the polynomial 0x1D,
+         * depending on if the high bit of x_i is set or not.
+         */
+         __m128i mask_1 = _mm_cmpgt_epi8(all_zeros, x_1);
+         __m128i mask_2 = _mm_cmpgt_epi8(all_zeros, x_2);
+         __m128i mask_3 = _mm_cmpgt_epi8(all_zeros, x_3);
+         __m128i mask_4 = _mm_cmpgt_epi8(all_zeros, x_4);
+         x_1 = _mm_add_epi8(x_1, x_1);
+         x_2 = _mm_add_epi8(x_2, x_2);
+         x_3 = _mm_add_epi8(x_3, x_3);
+         x_4 = _mm_add_epi8(x_4, x_4);
+         mask_1 = _mm_and_si128(mask_1, polynomial);
+         mask_2 = _mm_and_si128(mask_2, polynomial);
+         mask_3 = _mm_and_si128(mask_3, polynomial);
+         mask_4 = _mm_and_si128(mask_4, polynomial);
+         x_1 = _mm_xor_si128(x_1, mask_1);
+         x_2 = _mm_xor_si128(x_2, mask_2);
+         x_3 = _mm_xor_si128(x_3, mask_3);
+         x_4 = _mm_xor_si128(x_4, mask_4);
+         if((c >> j) & 1)
+            {
+            z_1 = _mm_xor_si128(z_1, x_1);
+            z_2 = _mm_xor_si128(z_2, x_2);
+            z_3 = _mm_xor_si128(z_3, x_3);
+            z_4 = _mm_xor_si128(z_4, x_4);
+            }
+         }
+      _mm_store_si128((__m128i*)(dst     ), z_1);
+      _mm_store_si128((__m128i*)(dst + 16), z_2);
+      _mm_store_si128((__m128i*)(dst + 32), z_3);
+      _mm_store_si128((__m128i*)(dst + 48), z_4);
+      src += 64;
+      dst += 64;
+      sz -= 64;
+      }
+   for(size_t i = 0; i != sz; ++i)
+      GF_ADDMULC(dst[i], src[i]);
     const gf* lim = &dst[sz - UNROLL + 1];
@@ -201,6 +301,7 @@
     lim += UNROLL - 1;
     for (; dst < lim; dst++, src++)       /* final components */
         GF_ADDMULC (*dst, *src);

-------------- next part --------------
A non-text attachment was scrubbed...
Name: time_encoding.py
Type: text/x-python
Size: 380 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20090202/93be23a3/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: time_encoding.c
Type: text/x-csrc
Size: 859 bytes
Desc: not available
URL: <http://tahoe-lafs.org/pipermail/tahoe-dev/attachments/20090202/93be23a3/attachment.c>

More information about the tahoe-dev mailing list