Intel64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Unary OP
This function applies unary operator OP (neg, not) to all words of a vector (Z,n).
Conroe
Using 64-bit general purpose registers, we can reach the maximum 64-bit memory bandwidth after unrolling the trivial load+op+store loop to 4 words/iteration:
shr N, 2 align 16 .a: mov AUX, [Z] OP AUX mov [Z], AUX mov AUX, [Z + 8] OP AUX mov [Z + 8], AUX lea Z, [Z + 32] mov AUX, [Z - 16] OP AUX mov [Z - 16], AUX mov AUX, [Z - 8] OP AUX mov [Z - 8], AUX dec N jnz .a
This runs in 4 cycles/iteration, or 1.00 cycle/word. This can be further reduced using 128-bit XMM registers. Independently of other limiting factors, the memory bandwidth corresponds to 0.50 cycles/word.
The following code is an optimal version of the SSE2 variant for the case OP=not, running at 0.50 cycles/word:
; Load 0xFFFFFFFF.FFFFFFFF.FFFFFFFF.FFFFFFFF to xmm4 xor AUX, AUX not AUX movd xmm4, AUX punpcklqdq xmm4, xmm4 shr N, 3 movdqa xmm0, xmm4 align 16 .a: pxor xmm0, [Z ] movdqa xmm1, xmm4 movdqa [Z ], xmm0 pxor xmm1, [Z + 16] movdqa xmm0, xmm4 movdqa [Z + 16], xmm1 pxor xmm0, [Z + 32] movdqa xmm1, xmm4 movdqa [Z + 32], xmm0 lea Z, [Z + 64] pxor xmm1, [Z - 16] movdqa xmm0, xmm4 movdqa [Z - 16], xmm1 dec N jnz .a
Intel64 Multiprecision : Memory Zero | Top of Page | Intel64 Multiprecision : Memory Copy |