AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Unary OP

This function applies unary operator OP (neg, not) to all words of a vector (Z,N).

These operators can only operate on a register. It means that for each word we need a memory read, the operation itself, and a memory write, or in total 3 slots in the execution units. Since we have 3 units per cycle, the asymptotic optimal timing is 1.00 cycle/word. But we have to increment the loop counter and the address register, taking an additional cycle. If P is the number of words processed per loop iteration, the optimal timing is (P+1)/P cycles/word, or 2.00 for P=1, 1.50 for P=2, 1.25 for P=4, and 1.125 for P=8.

Note that we didn't consider the instruction latencies and other CPU limitations. Usually, using out of order execution and register renaming, the CPU is able to schedule automatically the instructions from one loop to the next.

2 words per iteration
	shr	N, 1
        align   16
.a:
        mov	AUX, [Z    ]
	OP	AUX
	mov	[Z    ], AUX
        mov	AUX, [Z + 8]
	OP	AUX
	mov	[Z + 8], AUX
        lea     Z, [Z + 16]
        dec     N
        jnz	.a

As expected, each iteration requires 3 cycles, yelding 1.50 cycles/word. Unrolling to our fixed limit of 8 words per iteration, we get:

8 words per iteration
	shr	N, 3
        align   16
.a:
	mov	AUX, [Z     ]
	OP	AUX
	mov	[Z     ], AUX
        mov	AUX, [Z +  8]
	OP	AUX
	mov	[Z +  8], AUX
        mov	AUX, [Z + 16]
	OP	AUX
	mov	[Z + 16], AUX
        mov	AUX, [Z + 24]
	OP	AUX
	mov	[Z + 24], AUX
	mov	AUX, [Z + 32]
	OP	AUX
	mov	[Z + 32], AUX
        mov	AUX, [Z + 40]
	OP	AUX
	mov	[Z + 40], AUX
        mov	AUX, [Z + 48]
	OP	AUX
	mov	[Z + 48], AUX
        mov	AUX, [Z + 56]
	OP	AUX
	mov	[Z + 56], AUX
	lea	Z, [Z + 64]
	dec	N
        jnz	.a

This code runs at the predicted optimal speed of 1.125 cycles/word.