AMD64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function sets to 0 all words of a vector (Z,N).
A simple implementation, setting one word at each iteration of the loop, would be:
1 word per iteration: xor ZERO, ZERO ; ZERO is any 64-bit register lea Z, [Z + 8*N] ; Loop from the end neg N ; Loop body should start on a 16 byte boundary align 16 .a: mov [Z + 8*N], ZERO inc N jnz .a
Here ZERO, is an auxiliary register, and N, Z are the two input registers. The execution time here is 2 cycles/word. The limitation comes from the jnz: there must be at least two cycles between two taken jumps.
The CPU provides 3 integer evaluation units, each one with an integer ALU and an AGU (address generation unit), but only two 64-bit memory operations (read or write) can occur at the same time. So we can clear at most 2 words per cycle: the minimal timing imposed by the memory access is 0.5 cycles/word.
Since the loop body must be at least 2 cycles, to reach the optimum, we must clear at least 4 words per iteration, as in the following:
4 words per iteration: shr N, 2 ; One iteration is 4 words xor ZERO, ZERO align 16 .a: mov [Z ], ZERO mov [Z + 8], ZERO mov [Z + 16], ZERO mov [Z + 24], ZERO lea Z, [Z + 32] dec N jnz .a
One iteration requires 3 cycles, or 0.65 cycles/word. The lea updating Z may introduce a delay in the address computation. Moving it between the 2nd and 3rd mov, or replacing it with an add does not change the timing. We have to unroll the loop further, and use two alternating address registers Z and X to reach the optimal timing:
8 words per iteration, two address registers: shr N, 3 xor ZERO, ZERO align 16 .a: mov [Z ], ZERO mov [Z + 8], ZERO lea X, [Z + 32] mov [Z + 16], ZERO mov [Z + 24], ZERO dec N mov [X ], ZERO mov [X + 8], ZERO lea Z, [X + 32] mov [X + 16], ZERO mov [X + 24], ZERO jnz .a
One iteration now takes 4 cycles, for an optimal timing of 0.50 cycles/word.
|AMD64 Multiprecision : Introduction||Top of Page||AMD64 Multiprecision : Unary OP|