AMD64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Left and right shifts
These functions left/right shift the 64N-bit long vector (Z,N) by K=0..63 bits.
Shifting by only one bit can be done in 1.50 cycle/word using the Binary OP pattern. Longer shifts require more operations. Here is a first version using only general purpose registers:
xor CARRY, CARRY
; CARRY is the input "carry" (k lsb bits), can be passed as argument
mov rcx, K
mov MASK, 1
shl MASK, cl
dec MASK ; Has k ones as low significant bits
shr N, 1
align 16
.a:
mov AUX1, [Z]
mov AUX2, [Z + 8]
mov AUX3, MASK
mov AUX4, MASK
rol AUX1, cl
rol AUX2, cl
and AUX3, AUX1
and AUX4, AUX2
xor AUX1, CARRY
xor AUX2, AUX4
xor AUX1, AUX3
xor AUX2, AUX3
mov CARRY, AUX4
mov [Z], AUX1
mov [Z + 8], AUX2
lea Z, [Z + 16]
dec N
jnz .a
; CARRY is the output carry, can be returned
This code runs in 6 cycles/iteration, or 3.00 cycle/word. Here again the processor does a good scheduling job, because we have 17 slots among the 18 slots available in 6 cycles. This code first rotates the words to put bits in place, then moves the K least significant bits using binary operations. Another approach using two shifts for each word requires the same running time.
gmp 4.2.1 code runs at 2.50 cycles/word, making use of the 64-bit multimedia extensions (MMX). The gmp code is similar to the following:
movd mm6, K ; mm6 is k
mov AUX, K
xor AUX, 63
inc AUX
movd mm7, AUX ; mm7 is 64-k
pxor mm2, mm2 ; Input carry in mm2
shr N, 1
align 16
.a:
movq mm0, [Z]
movq mm1, [Z]
psllq mm0, mm6
psrlq mm1, mm7
por mm0, mm2
movq [Z], mm0
movq mm0, [Z + 8]
movq mm2, [Z + 8]
psllq mm0, mm6
psrlq mm2, mm7
por mm0, mm1
movq [Z + 8], mm0
lea Z, [Z + 16]
dec N
jnz .a
; Output carry in mm2
emms
In this loop we have 3 kinds of instructions: 2 memory read movq requiring one slot on any of the three (FADD, FMUL, FMISC) floating point execution units, 2 memory write movq requiring one slot on the FMISC unit, and the other 2 movq, 2 psllq, 2 psrlq, 2 por instructions requiring one slot on the FADD or FMUL units.
It appears than each word processing block requires 2 cycles, and there is an extra cycle for the lea and dec. Actually, unrolling the loop to 4 words per iteration gives a 2.25 cycles/word code:
movd mm6, K ; mm6 is k
mov AUX, K
xor AUX, 63
inc AUX
movd mm7, AUX ; mm7 is 64-k
pxor mm2, mm2 ; Input carry in mm2
shr N, 2
align 16
.a:
movq mm0, [Z]
movq mm1, [Z]
psllq mm0, mm6
psrlq mm1, mm7
por mm0, mm2
movq [Z], mm0
movq mm0, [Z + 8]
movq mm2, [Z + 8]
psllq mm0, mm6
psrlq mm2, mm7
por mm0, mm1
movq [Z + 8], mm0
movq mm0, [Z + 16]
movq mm1, [Z + 16]
psllq mm0, mm6
psrlq mm1, mm7
por mm0, mm2
movq [Z + 16], mm0
movq mm0, [Z + 24]
movq mm2, [Z + 24]
psllq mm0, mm6
psrlq mm2, mm7
por mm0, mm1
movq [Z + 24], mm0
lea Z, [Z + 32]
dec N
jnz .a
; Output carry in mm2
emms
However, further unrolling to 8 words per iteration goes back to 2.50 cycles/word.
Right shift is very similar, with the difference that access to Z must be reversed.
| AMD64 Multiprecision : Multiply and accumulate | Top of Page | AMD64 Multiprecision : Division by a scalar |
