Intel64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Binary OP
This function combines two vectors (X,N) and (Z,N) using binary operator op (add, sub, and, or, xor). The result is put back in (Z,N).
Obviously, the timing can't be better than Memory Copy. The following code can get us quite close to it:
shr N, 2
align 16
.a:
movdqa xmm0, [Z]
movdqa xmm1, [X]
lea X, [X + 32]
OP xmm0, xmm1
movdqa [Z], xmm0
movdqa xmm0, [Z + 16]
movdqa xmm1, [X - 16]
lea Z, [Z + 32]
OP xmm0, xmm1
movdqa [Z - 16], xmm0
dec N
jnz .a
One iteration requires 6.50 cycles, so we are running at 1.64 cycles/word. OP is an SSE2 128-bit integer instruction: pand, por, pxor, paddq, psubq, pandn.
The case of addition/subtraction with carry is different for several reasons:
- There is no SSE2 intruction handling the carry propagation.
- The carry propagation implementation (adc, sbb) on Intel processors is really slow.
- Even worse: we need to explicitely save/restore the carry between loop iterations, even when using lea and dec to update counters (these instructions don't touch the C flag).
Jason W. Martin proposes a pattern similar to the following, processing 4 words per iteration:
shr N, 2
; Initial carry bit in RAX
xor rax, rax
align 16
.a:
bt rax, 0
mov AUX0, [Z]
mov AUX1, [X]
OP AUX0, AUX1
mov [Z], AUX0
mov AUX0, [Z + 8]
mov AUX1, [X + 8]
OP AUX0, AUX1
mov [Z + 8], AUX0
mov AUX0, [Z + 16]
mov AUX1, [X + 16]
OP AUX0, AUX1
mov [Z + 16], AUX0
mov AUX0, [Z + 24]
mov AUX1, [X + 24]
OP AUX0, AUX1
mov [Z + 24], AUX0
setc al
lea X, [X + 32]
lea Z, [Z + 32]
dec N
jnz .a
; Final carry bit in RAX
It runs at 2.63 cycles/word. The main speed gain comes from the bt and setc instructions, used to restore and save the carry flag from one iteration to the next. By further unrolling the loop to 8 words per iteration, we can reach 2.53 cycles/word.
| Intel64 Multiprecision : Memory Copy | Top of Page | POV-Ray Buttons/Logos |
