AMD64 Multiprecision Arithmetic

Eric Bainville - Dec 2006

Binary OP

This function combines two vectors (X,N) and (Z,N) using binary operator op (add, sub, and, or, xor, adc, sbb). The result is put back in (Z,N).

Let's start with a simple 2 words/iteration loop, with paired memory access and operations:

2 words per iteration
	shr	N,1
        align   16
.a:
        mov	AUX0, [Z    ]
        mov	AUX1, [Z + 8]
	OP	AUX0, [X    ]
	OP	AUX1, [X + 8]
	mov	[Z    ], AUX0
	mov	[Z + 8], AUX1
        lea     X, [X + 16]
        lea     Z, [Z + 16]
        dec     N
        jnz	.a

This loop runs at 4 cycles/iteration, or 2.00 cycles/word. Each word requires two memory read, one write, and the operation is coupled with one of the memory read. For P words per iteration, we have 3P memory accesses, and 3P+3 execution slots needed (including the update of the two address registers and the loop counter).

The memory-limited timing is then 3P/2 cycles per iteration, and the slot-limited timing is P+1 cycles. Consequently, we are limited by memory bandwidth, and the optimal timing is 1.50 cycles/word. Let's unroll the loop further:

4 words per iteration
	shr	N,2
        align   16
.a:
        mov	AUX0, [X     ]
        mov	AUX1, [X +  8]
	lea	X, [X + 32]
	OP	AUX0, [Z     ]
	OP	AUX1, [Z +  8]
	mov	[Z     ], AUX0
	mov	[Z +  8], AUX1
        lea     Z, [Z + 32]
        mov	AUX0, [X - 16]
        mov	AUX1, [X -  8]
	OP	AUX0, [Z - 16]
	OP	AUX1, [Z -  8]
	dec	N
	mov	[Z - 16], AUX0
	mov	[Z -  8], AUX1
        jnz	.a

This code runs at 6 cycles/iteration, or 1.50 cycle/word. As we have seen, this is minimal: we have two memory accesses per cycle, and 15=3*P+3 execution slots used of the 18=6*3 available slots. Interestingly, the time is the same when op is adc or sbb, when a carry bit dependency is introduced (this is different on Intel processors).