# How fast can we compute the dot product?

Eric Bainville - Dec 2007

## SSE code

Let's consider first the case of float (32-bit) coordinates. We can read 4 components per cycle using movaps. Once we have two blocks of 4 components in two SSE registers, we multiply them together using mulps, and then accumulate the product into the result using addps. The result will be a 4-component vector. After the end of the loop we compute the final sum by adding together the 4 components of this vector.

From Agner Fog's excellent documents, we know the scheduling constraints of the instructions we use:
- movaps on port P2, result after 2 cycles, one call per cycle,
- mulps on port P0, result after 4 cycles, one call per cycle,
- addps on port P1, result after 3 cycles, one call per cycle,
- lea on port P0,
- jnz on port P5.

Based on these values, we should be limited by memory access. Reading 4 components for each vector requires at least 2 cycles, giving a theoretical time of 0.50 cycles per component.

Let's try the simplest pattern:

```	shr	N, 2
pxor	S0, S0
align   16
.a:
movaps	xmm0, [X]
mulps	xmm0, [Y]
lea	X, [X + 16]
lea	Y, [Y + 16]
dec	N
jnz	.a

; Compute result: sum of the 4 components of S0
```

It runs at 1.00 cycles/component. Unrolling the loop, with the sum in two different registers S0 and S1, we get 0.66 cycles/component with the following pattern:

```	movaps	xmm0, [X]
mulps	xmm0, [Y]
movaps	xmm1, [X + 16]
mulps	xmm1, [Y + 16]
lea	X, [X + 32]
lea	Y, [Y + 32]
```

We can reach 0.63 cycles/component by permuting the instructions:

```	movaps	xmm0, [X]
movaps	xmm1, [X + 16]
mulps	xmm0, [Y]
mulps	xmm1, [Y + 16]
lea	X, [X + 32]
lea	Y, [Y + 32]
```

The minimal timing of 0.50 cycles/component requires further unrolling (16 components per iteration) and "pipelined" computations to help the processor schedule the instructions. In this case, the first and last iterations are slighty different, and can't be part of the main loop. Note the cyclic pattern of register numbering, and only two sum registers are needed (S0 and S1). This is the pattern of the main loop:

```	movaps	xmm0, [X]
mulps	xmm3, [Y - 16]

movaps	xmm1, [X + 16]
mulps	xmm0, [Y]

movaps	xmm2, [X + 32]
mulps	xmm1, [Y + 16]

movaps	xmm3, [X + 48]
mulps	xmm2, [Y + 32]

lea	X, [X + 64]
lea	Y, [Y + 64]
```

In double precision, the same pattern (with -pd functions instead of -ps) reaches the optimal time of 1.00 cycles/component.