Tuesday, February 20, 2007

Quantifying cache bounces

Not many people seem to realize just how expensive is memory sharing and cache bounce in parallel systems. The following code is intended as a simple demonstration/benchmark of the bounce costs.

The main thread spawns NTHREADS worker threads. Each worker thread proceeds to independently increment a distinct memory location, NLOOPS times. In the first variant, the memory locations are in different cache lines. In the second variant (obtained by compiling with -DBOUNCE), the memory locations are adjiacent and in the same cache line.

#include <stdio.h>
#include <pthread.h>

#define NTHREADS 4
#define NLOOPS 200000000

static void *
worker (void *_cnt)
{
int i;
volatile int *cnt = _cnt;

*cnt = 0;

for (i = 0; i < NLOOPS; ++i)
(*cnt)++;

return 0;
}

static int cnt [NTHREADS][128];

int
main ()
{
int i, sum;
pthread_t t [NTHREADS];

for (i = 0; i < NTHREADS; ++i)
{
#ifdef BOUNCE
pthread_create (&t [i], 0, worker, &cnt [0][i]);
#else
pthread_create (&t [i], 0, worker, &cnt [i][0]);
#endif
}

sum = 0;
for (i = 0; i < NTHREADS; ++i)
{
pthread_join (t [i], 0);
#ifdef BOUNCE
sum += cnt [0][i];
#else
sum += cnt [i][0];
#endif
}

printf ("%d\n", sum);
return 0;
}

I ran the benchmark on GNU/Linux, kernel 2.6.15, 4x (2x dual-core) Opteron 2.2Mhz (Sun Fire X4200) and obtained the following results:

$ gcc -O3 cache-test.c -lpthread
$ time ./a.out
800000000

real 0m0.648s
user 0m2.396s
sys 0m0.000s
$ time ./a.out
800000000

real 0m0.615s
user 0m2.384s
sys 0m0.004s
$ time ./a.out
800000000

real 0m0.622s
user 0m2.436s
sys 0m0.004s
$ gcc -O3 -DBOUNCE cache-test.c -lpthread
$ time ./a.out
800000000

real 0m9.991s
user 0m29.374s
sys 0m0.008s
$ time ./a.out
800000000

real 0m9.957s
user 0m29.310s
sys 0m0.016s
$ time ./a.out
800000000

real 0m9.974s
user 0m29.330s
sys 0m0.012s
$

Taking the minimum of each measurement gives 9.957/0.615 == 16.9, in other words the cache bounces cause close to 1700% slowdown.

3 comments:

Seun Osewa said...

The lesson is Clear: let each processor work on a separate piece of data at a time.

Thanks for sharing this. Hope you post more in the future.

Nd said...

I couldn't resist trying this on my machine, which is an Intel (Core 2):

% gcc -o nobounce -O3 cache_test.c -lpthread
% gcc -DBOUNCE -o bounce -O3 cache_test.c -lpthread
% for i in {1..3}; time ./nobounce
800000000
./nobounce 2.23s user 0.00s system 186% cpu 1.194 total
800000000
./nobounce 2.14s user 0.01s system 186% cpu 1.155 total
800000000
./nobounce 2.15s user 0.00s system 184% cpu 1.168 total
% for i in {1..3}; time ./bounce
800000000
./bounce 2.64s user 0.01s system 191% cpu 1.382 total
800000000
./bounce 2.65s user 0.00s system 182% cpu 1.450 total
800000000
./bounce 2.68s user 0.00s system 184% cpu 1.455 total
% for i in {1..3}; time ./nobounce
800000000
./nobounce 2.17s user 0.00s system 188% cpu 1.151 total
800000000
./nobounce 2.16s user 0.00s system 182% cpu 1.184 total
800000000
./nobounce 2.13s user 0.01s system 185% cpu 1.155 total

A little slower, but not noticeably. Something hardware-specific, perhaps?

Loz said...

@nd Core 2' L1 cache is 8-way set associative. That means you need more than 8 threads to see the phenomenon. You might also need to tweak the 128, to ensure each thread is actually competing for the same cache lines. Opteron is only 2-way set associative, so 4 threads is plenty to get the cache bouncing.
I guess the reason why it slows down a little is the same reason a hash table with 4 objects in one bucket is slower than one with 4 buckets each containing one object.