AMD64 Multiprecision ArithmeticEric Bainville - Dec 2006
This function combines two vectors (X,N) and (Z,N) using binary operator op (add, sub, and, or, xor, adc, sbb). The result is put back in (Z,N).
Let's start with a simple 2 words/iteration loop, with paired memory access and operations:
2 words per iteration shr N,1 align 16 .a: mov AUX0, [Z ] mov AUX1, [Z + 8] OP AUX0, [X ] OP AUX1, [X + 8] mov [Z ], AUX0 mov [Z + 8], AUX1 lea X, [X + 16] lea Z, [Z + 16] dec N jnz .a
This loop runs at 4 cycles/iteration, or 2.00 cycles/word. Each word requires two memory read, one write, and the operation is coupled with one of the memory read. For P words per iteration, we have 3P memory accesses, and 3P+3 execution slots needed (including the update of the two address registers and the loop counter).
The memory-limited timing is then 3P/2 cycles per iteration, and the slot-limited timing is P+1 cycles. Consequently, we are limited by memory bandwidth, and the optimal timing is 1.50 cycles/word. Let's unroll the loop further:
4 words per iteration shr N,2 align 16 .a: mov AUX0, [X ] mov AUX1, [X + 8] lea X, [X + 32] OP AUX0, [Z ] OP AUX1, [Z + 8] mov [Z ], AUX0 mov [Z + 8], AUX1 lea Z, [Z + 32] mov AUX0, [X - 16] mov AUX1, [X - 8] OP AUX0, [Z - 16] OP AUX1, [Z - 8] dec N mov [Z - 16], AUX0 mov [Z - 8], AUX1 jnz .a
This code runs at 6 cycles/iteration, or 1.50 cycle/word. As we have seen, this is minimal: we have two memory accesses per cycle, and 15=3*P+3 execution slots used of the 18=6*3 available slots. Interestingly, the time is the same when op is adc or sbb, when a carry bit dependency is introduced (this is different on Intel processors).
|AMD64 Multiprecision : Memory Copy||Top of Page||AMD64 Multiprecision : Binary OP NOT|