How fast can we compute the dot product?Eric Bainville - Dec 2007
Let's consider first the case of float (32-bit) coordinates. We can read 4 components per cycle using movaps. Once we have two blocks of 4 components in two SSE registers, we multiply them together using mulps, and then accumulate the product into the result using addps. The result will be a 4-component vector. After the end of the loop we compute the final sum by adding together the 4 components of this vector.
From Agner Fog's excellent documents,
we know the scheduling constraints of the instructions we use:
- movaps on port P2, result after 2 cycles, one call per cycle,
- mulps on port P0, result after 4 cycles, one call per cycle,
- addps on port P1, result after 3 cycles, one call per cycle,
- lea on port P0,
- jnz on port P5.
Based on these values, we should be limited by memory access. Reading 4 components for each vector requires at least 2 cycles, giving a theoretical time of 0.50 cycles per component.
Let's try the simplest pattern:
shr N, 2 pxor S0, S0 align 16 .a: movaps xmm0, [X] mulps xmm0, [Y] addps S0, xmm0 lea X, [X + 16] lea Y, [Y + 16] dec N jnz .a ; Compute result: sum of the 4 components of S0
It runs at 1.00 cycles/component. Unrolling the loop, with the sum in two different registers S0 and S1, we get 0.66 cycles/component with the following pattern:
movaps xmm0, [X] mulps xmm0, [Y] addps S0, xmm0 movaps xmm1, [X + 16] mulps xmm1, [Y + 16] addps S1, xmm1 lea X, [X + 32] lea Y, [Y + 32]
We can reach 0.63 cycles/component by permuting the instructions:
movaps xmm0, [X] movaps xmm1, [X + 16] mulps xmm0, [Y] mulps xmm1, [Y + 16] addps S0, xmm0 addps S1, xmm1 lea X, [X + 32] lea Y, [Y + 32]
The minimal timing of 0.50 cycles/component requires further unrolling (16 components per iteration) and "pipelined" computations to help the processor schedule the instructions. In this case, the first and last iterations are slighty different, and can't be part of the main loop. Note the cyclic pattern of register numbering, and only two sum registers are needed (S0 and S1). This is the pattern of the main loop:
movaps xmm0, [X] mulps xmm3, [Y - 16] addps S0, xmm2 movaps xmm1, [X + 16] mulps xmm0, [Y] addps S1, xmm3 movaps xmm2, [X + 32] mulps xmm1, [Y + 16] addps S0, xmm0 movaps xmm3, [X + 48] mulps xmm2, [Y + 32] addps S1, xmm1 lea X, [X + 64] lea Y, [Y + 64]
In double precision, the same pattern (with -pd functions instead of -ps) reaches the optimal time of 1.00 cycles/component.
|SSE dot product : C implementation||Top of Page||AMD64 Multiprecision|