AMD64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Binary OP NOT
The next function will combine two vectors (X,N) and (Z,N) using binary operator op (and, or, xor), with the X operand bits inverted. The result is put back in (Z,N): Zi ⇐ Zi op not Xi.
The memory requirements are the same as the previous function. On additional execution slot is needed to invert each word, giving 4P+3 slots for one iteration processing P words. We still are limited by the memory accesses, but need to process 8 words per iteration to reach it. Since the processor had hard times scheduling the code, I replaced the "mov memory then not register" code by "mov register then xor memory", which has a smaller latency, and therefore makes scheduling easier.
8 words per iteration
shr N, 3
xor FULL, FULL
dec FULL ; FULL is all 1's
mov AUX0, FULL
mov AUX1, FULL
align 16
.a:
xor AUX0, [X ]
xor AUX1, [X + 8]
lea X, [X + 64]
OP AUX0, [Z ]
OP AUX1, [Z + 8]
mov AUX2, FULL
mov [Z ], AUX0
mov [Z + 8], AUX1
mov AUX3, FULL
xor AUX2, [X - 48]
xor AUX3, [X - 40]
mov AUX0, FULL
OP AUX2, [Z + 16]
OP AUX3, [Z + 24]
mov AUX1, FULL
mov [Z + 16], AUX2
mov [Z + 24], AUX3
lea Z, [Z + 64]
xor AUX0, [X - 32]
xor AUX1, [X - 24]
mov AUX2, FULL
OP AUX0, [Z - 32]
OP AUX1, [Z - 24]
mov AUX3, FULL
mov [Z - 32], AUX0
mov [Z - 24], AUX1
mov AUX0, FULL
xor AUX2, [X - 16]
xor AUX3, [X - 8]
mov AUX1, FULL
OP AUX2, [Z - 16]
OP AUX3, [Z - 8]
dec N
mov [Z - 16], AUX2
mov [Z - 8], AUX3
jnz .a
This code runs at 12 cycles/iteration, or 1.50 cycle/word. There is only one unused execution slot in each iteration!
| AMD64 Multiprecision : Binary OP | Top of Page | AMD64 Multiprecision : Scaling |
