Intel64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Unary OP
This function applies unary operator OP (neg, not) to all words of a vector (Z,n).
Conroe
Using 64-bit general purpose registers, we can reach the maximum 64-bit memory bandwidth after unrolling the trivial load+op+store loop to 4 words/iteration:
shr N, 2
align 16
.a:
mov AUX, [Z]
OP AUX
mov [Z], AUX
mov AUX, [Z + 8]
OP AUX
mov [Z + 8], AUX
lea Z, [Z + 32]
mov AUX, [Z - 16]
OP AUX
mov [Z - 16], AUX
mov AUX, [Z - 8]
OP AUX
mov [Z - 8], AUX
dec N
jnz .a
This runs in 4 cycles/iteration, or 1.00 cycle/word. This can be further reduced using 128-bit XMM registers. Independently of other limiting factors, the memory bandwidth corresponds to 0.50 cycles/word.
The following code is an optimal version of the SSE2 variant for the case OP=not, running at 0.50 cycles/word:
; Load 0xFFFFFFFF.FFFFFFFF.FFFFFFFF.FFFFFFFF to xmm4
xor AUX, AUX
not AUX
movd xmm4, AUX
punpcklqdq xmm4, xmm4
shr N, 3
movdqa xmm0, xmm4
align 16
.a:
pxor xmm0, [Z ]
movdqa xmm1, xmm4
movdqa [Z ], xmm0
pxor xmm1, [Z + 16]
movdqa xmm0, xmm4
movdqa [Z + 16], xmm1
pxor xmm0, [Z + 32]
movdqa xmm1, xmm4
movdqa [Z + 32], xmm0
lea Z, [Z + 64]
pxor xmm1, [Z - 16]
movdqa xmm0, xmm4
movdqa [Z - 16], xmm1
dec N
jnz .a
| Intel64 Multiprecision : Memory Zero | Top of Page | Intel64 Multiprecision : Memory Copy |
