# Intel64 Multiprecision Arithmetic

Eric Bainville - Dec 2006These pages relate my experiments on 64-bit multiprecision arithmetic
running on Core 2 Duo processors. *Sep 2009*: I updated some of these pages with Core i7 experiments,
and the latest versions of glibc and gmp.

I describe a set of basic memory, arithmetic, and logic operations on (long) vectors
of 64-bit words. The vector (A,N) is a vector of N consecutive 64-bit words A_{0}, A_{1},..., A_{N-1}.
Such a vector represents an unsigned integer A_{0}+A_{1}.2^{64}+...+A_{N-1}.2^{64(N-1)}. All integers in
the interval 0..2^{64N}-1 are representable, and have an unique representation.

To simplify the functions, I impose N to be a multiple of 8: the size of a vector is a multiple of 64 bytes (cache line). Vectors must be 8-byte aligned on the AMD64, and 16-byte aligned on the Intel64, where 128-bit SSE instructions are used.

I used information from the following sources:

- AMD Manuals,
- Intel Manuals,
**Agner Fog**'s Software Optimization Resources page, containing a lot of precious and easy to find information.

The GMP library already provides a very fast and comprehensive set of low-level multiprecision functions (mpn). I will often use it as a starting point for further optimizations, along with some optimized memory function of the GNU C Library.

On the CPU, execution time for each function is evaluated for 8KB vectors (N=1024 words), using the rdtsc instruction surrounded by two cpuid instructions to ensure all instructions are counted (because rdtsc may be executed out-of-order, but cpuid can't be). With this size we can get a very precise measure of the number of required cycles per word. Each function is called a large number of times, and we take the average number of cycles.

Here are the best timings (in CPU cycles per 64-bit word) I could get on the Core 2 Duo (Conroe):

Function | glibc | gmp | this page |

Z ← 0 | 1.00 | - | 0.50 |

Z ← op Z (not) | - | - | 0.50 |

Z ← op Z (neg) | - | - | 1.00 |

Z ← X | 1.75 | - | 1.50 |

Z ← Z op X (and or xor) | - | - | 1.63 |

Z ← Z op X (+ -) | - | 2.63 | 2.53 |

Z ← Z op not X (and or xor) | - | - | - |

Z ← Z * k | - | 4.00 | - |

Z ← Z * k op X (+ -) | - | 4.54 | - |

Z ← Z << k | - | 2.40 | - |

Z ← Z / k | - | 33.29 | - |

glibc refers to glibc 2.4, and gmp refers to gmp 4.2.1 with Jason W. Martin's patch.

*Sep 2009:* Here are the timings on a Core i7 920 (Bloomfield):

Function | glibc | gmp | this page |

Z ← 0 | 0.50 | - | 0.50 |

Z ← op Z (not) | - | - | - |

Z ← op Z (neg) | - | - | - |

Z ← X | 1.00 | - | 1.00 |

Z ← Z op X (and or xor) | - | - | - |

Z ← Z op X (+ -) | - | 2.25 | - |

Z ← Z op not X (and or xor) | - | - | - |

Z ← Z * k | - | 3.90 | - |

Z ← Z * k op X (+ -) | - | 5.00 | - |

Z ← Z << k | - | 1.78 | - |

Z ← Z / k | - | 20.94 | - |

glibc refers to glibc 2.9, and gmp refers to gmp 4.3.1. Hyperthreading is enabled in the processor (4 cores, 8 threads).

AMD64 Multiprecision | Top of Page | Intel64 Multiprecision : Memory Zero |