AMD64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function multiplies the number represented by vector (Z,N) by a constant K.
Processing each word requires at least 6 execution slots: loading RAX (1 slot), mul (2 slots), unloading/accumulating RAX and RDX (2 slots), storing the result (1 slot). At least one additional slot is required to handle correctly the carry propagation. Consequently, we can expect timings near 3.00 cycles/word. Actually, 3.00 itself is quite hard to reach. It can be done with the following pattern, processing two words per iteration:
shr N, 1 xor AUX0, AUX0 ; AUX0: input carry in 0..K-1 align 16 .a: mov RAX, [Z] mul K xor AUX1, AUX1 lea Z, [Z + 16] add AUX0, RAX adc AUX1, RDX mov [Z - 16], AUX0 mov RAX, [Z - 8] mul K xor AUX0, AUX0 add AUX1, RAX adc AUX0, RDX mov [Z - 8], AUX1 dec N jnz .a ; AUX0: output carry in 0..K-1
It runs in 6 cycles/iteration, or 3.00 cycles/word. All 14 instructions occupy 16 slots in the integer execution units, and are scheduled in 18 slots (6 cycles).
The input carry can be optionally passed as argument, and in this case must be in 0..K-1. The output carry can be returned too, and will be in the same range.
I could not find a pattern under 3.00 cycles/word, by unrolling the loop further, or "pipelining" the pattern.
|AMD64 Multiprecision : Binary OP NOT||Top of Page||AMD64 Multiprecision : Multiply and accumulate|