The main thread spawns NTHREADS worker threads. Each worker thread proceeds to independently increment a distinct memory location, NLOOPS times. In the first variant, the memory locations are in different cache lines. In the second variant (obtained by compiling with -DBOUNCE), the memory locations are adjiacent and in the same cache line.
#include <stdio.h>
#include <pthread.h>
#define NTHREADS 4
#define NLOOPS 200000000
static void *
worker (void *_cnt)
{
int i;
volatile int *cnt = _cnt;
*cnt = 0;
for (i = 0; i < NLOOPS; ++i)
(*cnt)++;
return 0;
}
static int cnt [NTHREADS][128];
int
main ()
{
int i, sum;
pthread_t t [NTHREADS];
for (i = 0; i < NTHREADS; ++i)
{
#ifdef BOUNCE
pthread_create (&t [i], 0, worker, &cnt [0][i]);
#else
pthread_create (&t [i], 0, worker, &cnt [i][0]);
#endif
}
sum = 0;
for (i = 0; i < NTHREADS; ++i)
{
pthread_join (t [i], 0);
#ifdef BOUNCE
sum += cnt [0][i];
#else
sum += cnt [i][0];
#endif
}
printf ("%d\n", sum);
return 0;
}
I ran the benchmark on GNU/Linux, kernel 2.6.15, 4x (2x dual-core) Opteron 2.2Mhz (Sun Fire X4200) and obtained the following results:
$ gcc -O3 cache-test.c -lpthread
$ time ./a.out
800000000
real 0m0.648s
user 0m2.396s
sys 0m0.000s
$ time ./a.out
800000000
real 0m0.615s
user 0m2.384s
sys 0m0.004s
$ time ./a.out
800000000
real 0m0.622s
user 0m2.436s
sys 0m0.004s
$ gcc -O3 -DBOUNCE cache-test.c -lpthread
$ time ./a.out
800000000
real 0m9.991s
user 0m29.374s
sys 0m0.008s
$ time ./a.out
800000000
real 0m9.957s
user 0m29.310s
sys 0m0.016s
$ time ./a.out
800000000
real 0m9.974s
user 0m29.330s
sys 0m0.012s
$
Taking the minimum of each measurement gives 9.957/0.615 == 16.9, in other words the cache bounces cause close to 1700% slowdown.
3 comments:
The lesson is Clear: let each processor work on a separate piece of data at a time.
Thanks for sharing this. Hope you post more in the future.
I couldn't resist trying this on my machine, which is an Intel (Core 2):
% gcc -o nobounce -O3 cache_test.c -lpthread
% gcc -DBOUNCE -o bounce -O3 cache_test.c -lpthread
% for i in {1..3}; time ./nobounce
800000000
./nobounce 2.23s user 0.00s system 186% cpu 1.194 total
800000000
./nobounce 2.14s user 0.01s system 186% cpu 1.155 total
800000000
./nobounce 2.15s user 0.00s system 184% cpu 1.168 total
% for i in {1..3}; time ./bounce
800000000
./bounce 2.64s user 0.01s system 191% cpu 1.382 total
800000000
./bounce 2.65s user 0.00s system 182% cpu 1.450 total
800000000
./bounce 2.68s user 0.00s system 184% cpu 1.455 total
% for i in {1..3}; time ./nobounce
800000000
./nobounce 2.17s user 0.00s system 188% cpu 1.151 total
800000000
./nobounce 2.16s user 0.00s system 182% cpu 1.184 total
800000000
./nobounce 2.13s user 0.01s system 185% cpu 1.155 total
A little slower, but not noticeably. Something hardware-specific, perhaps?
@nd Core 2' L1 cache is 8-way set associative. That means you need more than 8 threads to see the phenomenon. You might also need to tweak the 128, to ensure each thread is actually competing for the same cache lines. Opteron is only 2-way set associative, so 4 threads is plenty to get the cache bouncing.
I guess the reason why it slows down a little is the same reason a hash table with 4 objects in one bucket is slower than one with 4 buckets each containing one object.
Post a Comment