Tuesday, February 20, 2007

Quantifying cache bounces

Not many people seem to realize just how expensive is memory sharing and cache bounce in parallel systems. The following code is intended as a simple demonstration/benchmark of the bounce costs.

The main thread spawns NTHREADS worker threads. Each worker thread proceeds to independently increment a distinct memory location, NLOOPS times. In the first variant, the memory locations are in different cache lines. In the second variant (obtained by compiling with -DBOUNCE), the memory locations are adjiacent and in the same cache line.

#include <stdio.h>
#include <pthread.h>

#define NTHREADS 4
#define NLOOPS 200000000

static void *
worker (void *_cnt)
{
int i;
volatile int *cnt = _cnt;

*cnt = 0;

for (i = 0; i < NLOOPS; ++i)
(*cnt)++;

return 0;
}

static int cnt [NTHREADS][128];

int
main ()
{
int i, sum;
pthread_t t [NTHREADS];

for (i = 0; i < NTHREADS; ++i)
{
#ifdef BOUNCE
pthread_create (&t [i], 0, worker, &cnt [0][i]);
#else
pthread_create (&t [i], 0, worker, &cnt [i][0]);
#endif
}

sum = 0;
for (i = 0; i < NTHREADS; ++i)
{
pthread_join (t [i], 0);
#ifdef BOUNCE
sum += cnt [0][i];
#else
sum += cnt [i][0];
#endif
}

printf ("%d\n", sum);
return 0;
}

I ran the benchmark on GNU/Linux, kernel 2.6.15, 4x (2x dual-core) Opteron 2.2Mhz (Sun Fire X4200) and obtained the following results:

$ gcc -O3 cache-test.c -lpthread
$ time ./a.out
800000000

real 0m0.648s
user 0m2.396s
sys 0m0.000s
$ time ./a.out
800000000

real 0m0.615s
user 0m2.384s
sys 0m0.004s
$ time ./a.out
800000000

real 0m0.622s
user 0m2.436s
sys 0m0.004s
$ gcc -O3 -DBOUNCE cache-test.c -lpthread
$ time ./a.out
800000000

real 0m9.991s
user 0m29.374s
sys 0m0.008s
$ time ./a.out
800000000

real 0m9.957s
user 0m29.310s
sys 0m0.016s
$ time ./a.out
800000000

real 0m9.974s
user 0m29.330s
sys 0m0.012s
$

Taking the minimum of each measurement gives 9.957/0.615 == 16.9, in other words the cache bounces cause close to 1700% slowdown.