Intel64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Memory Zero

This function sets to 0 all words of a vector (Z,n).

Conroe

The Core 2 Duo architecture can write one 128-bit word per clock cycle (the Athlon 64 could write two 64-bit words per cycle). Doing this requires using a 128-bit XMM register, as in the following code:

	shr     N, 1
        pxor    xmm0, xmm0
        align   16
.a:
        movdqa  [Z], xmm0
	lea	Z, [Z + 16]
        dec     N
        jnz     .a

The lea and dec are both computed in the same cycle as the movdqa, so the loop runs at 1 cycle/iteration, leading 0.50 cycle/word.

Bloomfield

The above code runs slower on a Core i7. We have to unroll the loop further to reach 0.50 cycle/word:

	shr     N, 2
        pxor    xmm0, xmm0
        align   16
.a:
        movdqa  [Z], xmm0
        movdqa  [Z+16], xmm0
	add     Z, 32
        sub     N, 1
        jnz     .a