# How fast can we compute the dot product?

Eric Bainville - Dec 2007In these pages, we will see how fast we can compute a dot product
between two vectors x, y of dimension n. The dot product is
defined as <x,y> = x_{1}y_{1} + x_{2}y_{2} + ... + x_{n}y_{n}.

Each component can be either a 32-bit (float) or 64-bit
(double) floating point number. We focus on the special case where
both vectors are stored in consecutive memory locations (the BLAS
incx parameter is 1 for both vectors). As in the other pages on
multiprecision, our goal is to find the **fastest inner-loop
pattern**, corresponding to the best speed for "large" vectors
stored in the L1 cache. In our case, the L1 cache is 32 KB, so we
take a maximal size of n=2048 for float and n=1024 for double,
using 16 KB for two vectors (leaving some cache free). We will assume
vectors to start at addresses multiple of 64 bytes.

All measurements are made on a Core 2 Duo E6600 (Conroe core, 266 MHz x9, DDR2 memory 266 MHz x3), and only one thread is used. C code is compiled with gcc 4.1.2 64 bits, -march=nocona -O2. The benchmark code can provide a precise value (1% error) for the asymptotic number of cycles per vector component. When unrolling the loops, we will assume the vector size to be a multiple of 4 or 8 or 16 as needed (this does not change the asymptotic time).

Here are the timings in CPU cycles per component (Atlas BLAS 3.8.0, Intel MKL 10.0.1.014), and the timings of the code discussed in these pages. Different FSB and memory/cpu speed settings can lead to slower results.

Dot product code | float | double |

C code, trivial | 3.00 | 3.00 |

C code, unroll 2x | 2.00 | 2.00 |

C code, unroll 4x | 2.00 | 2.00 |

SSE code (this page) | 0.50 | 1.00 |

Atlas 3.8.0 | 0.50 | 1.00 |

Intel MKL 10.0.1.014 | 0.50 | 1.00 |

SSE gradient | Top of Page | SSE dot product : C implementation |