Intel64 Multiprecision Arithmetic
Eric Bainville - Dec 2006Memory Zero
This function sets to 0 all words of a vector (Z,n).
Conroe
The Core 2 Duo architecture can write one 128-bit word per clock cycle (the Athlon 64 could write two 64-bit words per cycle). Doing this requires using a 128-bit XMM register, as in the following code:
shr N, 1
pxor xmm0, xmm0
align 16
.a:
movdqa [Z], xmm0
lea Z, [Z + 16]
dec N
jnz .a
The lea and dec are both computed in the same cycle as the movdqa, so the loop runs at 1 cycle/iteration, leading 0.50 cycle/word.
Bloomfield
The above code runs slower on a Core i7. We have to unroll the loop further to reach 0.50 cycle/word:
shr N, 2
pxor xmm0, xmm0
align 16
.a:
movdqa [Z], xmm0
movdqa [Z+16], xmm0
add Z, 32
sub N, 1
jnz .a
| Intel64 Multiprecision : Introduction | Top of Page | Intel64 Multiprecision : Unary OP |
