OpenCL Multiprecision
Eric Bainville  Jan 2010Some multiplication algorithms
Introduction
The product of two large numbers is used in virtually all higher level operations, and consequently its performance is crucial. As we have seen in the previous pages, the CPU will handle faster the case of numbers up to one million 64bit digits, so we must implement the asymptotically fast methods based on FFT. The best optimized CPU multiplication code is probably in the MPIR and GMP libraries. It uses a mix of Karatsuba, ToomCook, and SchönhageStrassen algorithms depending on the size of the inputs.
Given a number p stored on N digits in base B: p=Σp_{i}*B^{i}, with i in 0..N1, and p_{i} in M..M. We associate to p the polynomial of degree N1 P(X)=Σp_{i}*X^{i}, i=0..N1. Then we have p=P(B).
The product of two numbers p and q is then done by first multiplying the corresponding polynomials R(X)=P(X)*Q(X), and then normalize the coefficients of R(B). For two numbers on N digits with max value M, the max value for the central coefficient (degree N1) of R(X) is N*M^{2}. R has 2N1 coefficients (degree 2N2), and the product p*q has 2N digits when M≤B1.
Example with B=100 and N=4
In base B=100, the numbers p=75978566 and q=15439875 are represented each with N=4 digits: p_{0..3}=(66,85,97,75) and q_{0..3}=(75,98,43,15). The corresponding polynomials are:
P(X) = 75*X^{3} + 97*X^{2} + 85*X + 66 Q(X) = 15*X^{3} + 43*X^{2} + 98*X + 75
To compute p*q using polynomials, we first compute R(X)=P(X)*Q(X):
R(X)=1125*X^{6}+4680*X^{5}+12796*X^{4}+19776*X^{3}+18443*X^{2}+12843*X+4950
And then normalize the coefficients from degree 0 to 6, propagating the "carry" in base B=100:
degree 0 49 50 1 1 28 43 2 1 84 43 3 1 97 76 4 1 27 96 5 46 80 6 11 25  11 73 09 95 61 71 92 50
Finally we obtain the expected result: p*q=1173099561719250.
Representations of a polynomial
All usual methods to compute efficiently the product of two polynomials are based on the following fact: a polynomial of degree N1 (i.e. with N coefficients) is uniquely determined by its value at N distincts points.
In other terms, we can equivalently represent a polynomial P of degree N1 by the sequence of its N coefficients (p_{0},p_{1},...,p_{N1}) or by the sequence of the N values (P(x_{0}),P(x_{1}),...,P(x_{N1})) it takes for a set of distinct points x_{i}.
This representation is interesting to compute the product, because (P*Q)(x_{i})=P(x_{i})*Q(x_{i}): the value of the product at point x_{i} is the product of the two values at x_{i}.
Note that in this case, because the product has 2N1 coefficients, the input polynomials must be evaluated for at least at 2N1 points, more than needed to represent each polynomial.
The product of two polynomials is then computed in three steps:
1. evaluate: get the value of both inputs for all points x_{i}
2. compute the output value at all points x_{i}: R(x_{i})=P(x_{i})*Q(x_{i})
3. interpolate: get the coefficients of the result (get the interpolating polynomial)
Provided steps 1 and 3 can be computed in less than O(N^{2}), we get a method faster than the trivial polynomial product.
Another note. I didn't specify a domain for the coefficients of the polynomials. Each algorithm has a "good" choice of the domain: integers, integers modulo n, complex numbers, polynomials,...
Evaluation and interpolation schemes
Evaluating a polynomial P=Σp_{i}*X^{i} of degree N1 at 2N1 points x_{i} is linear in the polynomial coefficients p_{i}. The vector y, y_{i}=P(x_{i}) is the product of a 2N1×N matrix X and the coefficients vector p: y=X*p. Row i of X is (x_{i}^{0},x_{i}^{1},...,x_{i}^{N1}). Some methods allow x_{i}=∞, with the convention P(∞)=p_{N1}, i.e. the leading coefficient of P.
The objective is to evaluate this matrixvector product with the lowest cost. We will see a few examples below, taken from the Wikipedia pages of the ToomCook and SchönhageStrassen algorithms (links above), and the article What About ToomCook Matrices Optimality? by Marco Bodrato and Alberto Zanoni.
Example: 0,1,∞
This corresponds to the Karatsuba algorithm. The matrix X is:
p_{0} p_{1} 1 0 x_{0}=0 1 1 x_{1}=1 0 1 x_{2}=∞
Here we have no mystery: y_{0}=p_{0}, y_{1}=p_{0}+p_{1}, and y_{2}=p_{1}. With x_{1}=1 instead of +1 we obtain the Knuth scheme, exposed in The Art of Computer Programming: Seminumerical Algorithms, section 4.3.3.
The coefficients are obtained as the image of the vector P(x_{i})*Q(x_{i}) by the inverse matrix (this time with all 2N1 values):
1 0 0 1 0 0 1 1 1 == inverse ==> 1 1 1 0 0 1 0 0 1
Substituting everything, we get the Karatsuba algorithm, computing the product of P and Q with 3*mul+4*add:
r_{0} = p_{0} * q_{0} r_{2} = p_{1} * q_{1} r_{1} = (p_{0}+p_{1}) * (q_{0}+q_{1})  r_{0}  r_{2}
Example: 0,+1,1,2,∞
This is known as the Toom3 algorithm, used to multiply two polynomials with N=3 coefficients (degree 2). The matrix X is the following. For each row, we give the natural evaluation time.
p_{0} p_{1} p_{2} 1 0 0 x_{0}=0, no eval 1 1 1 x_{1}=+1, 2*add 1 1 1 x_{2}=1, 2*add 1 2 4 x_{3}=2, 2*add+2*shift 0 0 1 x_{4}=∞, no eval
The following evaluation sequence is the fastest (from the paper of Bodrato and Zanoni), with 5*add+1*shift instead of 6*add+2*shift:
s_{1} 1 0 0 p_{0}, no eval s_{2} 1 0 1 p_{0}+p_{2}, 1*add s_{3} 1 1 1 s_{2}+p_{1}, 1*add s_{4} 1 1 1 s_{2}p_{1}, 1*add s_{5} 1 2 4 (s_{4}+p_{2})*2p_{0}, 2*add+1*shift s_{6} 0 0 1 p_{2}, no eval
The interpolation matrix is:
1 0 0 0 0 1/2 1/3 1 1/6 2 1 1/2 1/2 0 1 1/2 1/6 1/2 1/6 2 0 0 0 0 1
The fastest interpolation sequence (Bodrato and Zanoni) can be found on the Wikipedia page, and requires 8*add+3*shift+1*div.
Example: ω^{0},ω^{1},...,ω^{n1}
This is known as the Discrete Fourier Transform (DFT). ω is a complex nth root of 1, with n generally a power of two. For the sake of simplicity, let's consider in this example the specific case of n=16, and ω=e^{2*I*π/16}. We have ω^{4}=I, ω^{8}=1, ω^{12}=I. It is used to multiply polynomials with N=8 coefficients. The X matrix is the following. To simplify the notation, each entry in the matrix is the exponent of ω, or , for example 3 means ω^{3}. Entry  means 0.
k: y_{k} as a sum of ω^{Xki}*p_{i} 0: 0 0 0 0 0 0 0 0 1: 0 1 2 3 4 5 6 7 2: 0 2 4 6 8 10 12 14 3: 0 3 6 9 12 15 2 5 4: 0 4 8 12 0 4 8 12 5: 0 5 10 15 4 9 14 3 6: 0 6 12 2 8 14 4 10 7: 0 7 14 5 12 3 10 1 8: 0 8 0 8 0 8 0 8 9: 0 9 2 11 4 13 6 15 10: 0 10 4 14 8 2 12 6 11: 0 11 6 1 12 7 2 13 12: 0 12 8 4 0 12 8 4 13: 0 13 10 7 4 1 14 11 14: 0 14 12 10 8 6 4 2 15: 0 15 14 13 12 11 10 9
The classical CooleyTuckey algorithm computes these 16 values recursively. It is one of the Fast Fourier Transform (FFT) algorithms to compute the DFT. Let's suppose we managed to compute the following values s_{k} and t_{k}. This is done using exactly the same algorithm on the N/2 evennumbered (s) and oddnumbered (t) coefficients, and with ω^{2} in place of ω:
k: s_{k}
0: 0  0  0  0 
1: 0  2  4  6 
2: 0  4  8  12 
3: 0  6  12  2 
4: 0  8  0  8 
5: 0  10  4  14 
6: 0  12  8  4 
7: 0  14  12  10 

k: t_{k}
0:  0  0  0  0
1:  0  2  4  6
2:  0  4  8  12
3:  0  6  12  2
4:  0  8  0  8
5:  0  10  4  14
6:  0  12  8  4
7:  0  14  12  10

Then we can assemble these two sequences to obtain y:
y_{k} = s_{k} + ω^{k} t_{k} y_{k+8} = s_{k}  ω^{k} t_{k} k = 0..7
The next page shows in more detail the SchönhageStrassen Algorithm, using the same DFT algorithms, but on integers modulo 2^{m}+1.
OpenCL Multiprecision : Introduction  Top of Page  OpenCL Multiprecision : SchönhageStrassen 