AMD64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Memory Copy
This function copies vector (X,N) to (Z,N). The vectors shall not overlap.
The AMD64 Optimization Manual (section 5.13) provides the following code:
4 words per iteration
shr N, 2
align 16
.a:
mov AUX0, [X ]
mov AUX1, [X + 8]
lea X, [X + 32]
mov [Z ], AUX0
mov [Z + 8], AUX1
lea Z, [Z + 32]
mov AUX0, [X - 16]
mov AUX1, [X - 8]
dec N
mov [Z - 16], AUX0
mov [Z - 8], AUX1
jnz .a
It runs in 6 cycles/iteration, or 1.50 cycle/word. With the limit of two memory read/write per cycle, the optimal speed would be 1.00 cycle/word. Despite a lot of experiments, I could not find a faster sequence...
| AMD64 Multiprecision : Unary OP | Top of Page | AMD64 Multiprecision : Binary OP |
