Intel64 Multiprecision ArithmeticEric Bainville - Dec 2006
These pages relate my experiments on 64-bit multiprecision arithmetic running on Core 2 Duo processors. Sep 2009: I updated some of these pages with Core i7 experiments, and the latest versions of glibc and gmp.
I describe a set of basic memory, arithmetic, and logic operations on (long) vectors of 64-bit words. The vector (A,N) is a vector of N consecutive 64-bit words A0, A1,..., AN-1. Such a vector represents an unsigned integer A0+A1.264+...+AN-1.264(N-1). All integers in the interval 0..264N-1 are representable, and have an unique representation.
To simplify the functions, I impose N to be a multiple of 8: the size of a vector is a multiple of 64 bytes (cache line). Vectors must be 8-byte aligned on the AMD64, and 16-byte aligned on the Intel64, where 128-bit SSE instructions are used.
I used information from the following sources:
- AMD Manuals,
- Intel Manuals,
- Agner Fog's Software Optimization Resources page, containing a lot of precious and easy to find information.
The GMP library already provides a very fast and comprehensive set of low-level multiprecision functions (mpn). I will often use it as a starting point for further optimizations, along with some optimized memory function of the GNU C Library.
On the CPU, execution time for each function is evaluated for 8KB vectors (N=1024 words), using the rdtsc instruction surrounded by two cpuid instructions to ensure all instructions are counted (because rdtsc may be executed out-of-order, but cpuid can't be). With this size we can get a very precise measure of the number of required cycles per word. Each function is called a large number of times, and we take the average number of cycles.
Here are the best timings (in CPU cycles per 64-bit word) I could get on the Core 2 Duo (Conroe):
|Function||glibc||gmp||this page||Z ← 0||1.00||-||0.50||Z ← op Z (not)||-||-||0.50||Z ← op Z (neg)||-||-||1.00||Z ← X||1.75||-||1.50||Z ← Z op X (and or xor)||-||-||1.63||Z ← Z op X (+ -)||-||2.63||2.53||Z ← Z op not X (and or xor)||-||-||-||Z ← Z * k||-||4.00||-||Z ← Z * k op X (+ -)||-||4.54||-||Z ← Z << k||-||2.40||-||Z ← Z / k||-||33.29||-|
glibc refers to glibc 2.4, and gmp refers to gmp 4.2.1 with Jason W. Martin's patch.
Sep 2009: Here are the timings on a Core i7 920 (Bloomfield):
|Function||glibc||gmp||this page||Z ← 0||0.50||-||0.50||Z ← op Z (not)||-||-||-||Z ← op Z (neg)||-||-||-||Z ← X||1.00||-||1.00||Z ← Z op X (and or xor)||-||-||-||Z ← Z op X (+ -)||-||2.25||-||Z ← Z op not X (and or xor)||-||-||-||Z ← Z * k||-||3.90||-||Z ← Z * k op X (+ -)||-||5.00||-||Z ← Z << k||-||1.78||-||Z ← Z / k||-||20.94||-|
glibc refers to glibc 2.9, and gmp refers to gmp 4.3.1. Hyperthreading is enabled in the processor (4 cores, 8 threads).
|AMD64 Multiprecision||Top of Page||Intel64 Multiprecision : Memory Zero|