How fast can we compute the dot product?Eric Bainville - Dec 2007
In these pages, we will see how fast we can compute a dot product between two vectors x, y of dimension n. The dot product is defined as <x,y> = x1y1 + x2y2 + ... + xnyn.
Each component can be either a 32-bit (float) or 64-bit (double) floating point number. We focus on the special case where both vectors are stored in consecutive memory locations (the BLAS incx parameter is 1 for both vectors). As in the other pages on multiprecision, our goal is to find the fastest inner-loop pattern, corresponding to the best speed for "large" vectors stored in the L1 cache. In our case, the L1 cache is 32 KB, so we take a maximal size of n=2048 for float and n=1024 for double, using 16 KB for two vectors (leaving some cache free). We will assume vectors to start at addresses multiple of 64 bytes.
All measurements are made on a Core 2 Duo E6600 (Conroe core, 266 MHz x9, DDR2 memory 266 MHz x3), and only one thread is used. C code is compiled with gcc 4.1.2 64 bits, -march=nocona -O2. The benchmark code can provide a precise value (1% error) for the asymptotic number of cycles per vector component. When unrolling the loops, we will assume the vector size to be a multiple of 4 or 8 or 16 as needed (this does not change the asymptotic time).
Here are the timings in CPU cycles per component (Atlas BLAS 3.8.0, Intel MKL 10.0.1.014), and the timings of the code discussed in these pages. Different FSB and memory/cpu speed settings can lead to slower results.
|Dot product code||float||double|
|C code, trivial||3.00||3.00|
|C code, unroll 2x||2.00||2.00|
|C code, unroll 4x||2.00||2.00|
|SSE code (this page)||0.50||1.00|
|Intel MKL 10.0.1.014||0.50||1.00|
|SSE gradient||Top of Page||SSE dot product : C implementation|