Intel64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Unary OP

This function applies unary operator OP (neg, not) to all words of a vector (Z,n).

Conroe

Using 64-bit general purpose registers, we can reach the maximum 64-bit memory bandwidth after unrolling the trivial load+op+store loop to 4 words/iteration:

        shr     N, 2
        align   16
.a:
        mov     AUX, [Z]
        OP      AUX
        mov     [Z], AUX
        mov     AUX, [Z + 8]
        OP      AUX
        mov     [Z + 8], AUX

        lea     Z, [Z + 32]

        mov     AUX, [Z - 16]
        OP      AUX
        mov     [Z - 16], AUX
        mov     AUX, [Z - 8]
        OP      AUX
        mov     [Z - 8], AUX

        dec     N
        jnz	.a

This runs in 4 cycles/iteration, or 1.00 cycle/word. This can be further reduced using 128-bit XMM registers. Independently of other limiting factors, the memory bandwidth corresponds to 0.50 cycles/word.

The following code is an optimal version of the SSE2 variant for the case OP=not, running at 0.50 cycles/word:

        ; Load 0xFFFFFFFF.FFFFFFFF.FFFFFFFF.FFFFFFFF to xmm4
        xor     AUX, AUX
        not     AUX
        movd    xmm4, AUX
        punpcklqdq  xmm4, xmm4
        shr     N, 3
        movdqa  xmm0, xmm4
        align   16
.a:
        pxor    xmm0, [Z     ]
        movdqa  xmm1, xmm4
        movdqa  [Z     ], xmm0

        pxor    xmm1, [Z + 16]
        movdqa  xmm0, xmm4
        movdqa  [Z + 16], xmm1

        pxor    xmm0, [Z + 32]
        movdqa  xmm1, xmm4
        movdqa  [Z + 32], xmm0

        lea     Z, [Z + 64]

        pxor    xmm1, [Z - 16]
        movdqa  xmm0, xmm4
        movdqa  [Z - 16], xmm1

        dec     N
        jnz	.a