Intel64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function combines two vectors (X,N) and (Z,N) using binary operator op (add, sub, and, or, xor). The result is put back in (Z,N).
Obviously, the timing can't be better than Memory Copy. The following code can get us quite close to it:
shr N, 2 align 16 .a: movdqa xmm0, [Z] movdqa xmm1, [X] lea X, [X + 32] OP xmm0, xmm1 movdqa [Z], xmm0 movdqa xmm0, [Z + 16] movdqa xmm1, [X - 16] lea Z, [Z + 32] OP xmm0, xmm1 movdqa [Z - 16], xmm0 dec N jnz .a
One iteration requires 6.50 cycles, so we are running at 1.64 cycles/word. OP is an SSE2 128-bit integer instruction: pand, por, pxor, paddq, psubq, pandn.
The case of addition/subtraction with carry is different for several reasons:
- There is no SSE2 intruction handling the carry propagation.
- The carry propagation implementation (adc, sbb) on Intel processors is really slow.
- Even worse: we need to explicitely save/restore the carry between loop iterations, even when using lea and dec to update counters (these instructions don't touch the C flag).
Jason W. Martin proposes a pattern similar to the following, processing 4 words per iteration:
shr N, 2 ; Initial carry bit in RAX xor rax, rax align 16 .a: bt rax, 0 mov AUX0, [Z] mov AUX1, [X] OP AUX0, AUX1 mov [Z], AUX0 mov AUX0, [Z + 8] mov AUX1, [X + 8] OP AUX0, AUX1 mov [Z + 8], AUX0 mov AUX0, [Z + 16] mov AUX1, [X + 16] OP AUX0, AUX1 mov [Z + 16], AUX0 mov AUX0, [Z + 24] mov AUX1, [X + 24] OP AUX0, AUX1 mov [Z + 24], AUX0 setc al lea X, [X + 32] lea Z, [Z + 32] dec N jnz .a ; Final carry bit in RAX
It runs at 2.63 cycles/word. The main speed gain comes from the bt and setc instructions, used to restore and save the carry flag from one iteration to the next. By further unrolling the loop to 8 words per iteration, we can reach 2.53 cycles/word.
|Intel64 Multiprecision : Memory Copy||Top of Page||POV-Ray Buttons/Logos|