OpenCL Multiprecision

Eric Bainville - Jan 2010

The Schönhage-Strassen Algorithm (SSA)

Combination patterns

For more details on the SSA, refer to the Wikipedia page, and the paper A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm by Pierrick Gaudry, Alexander Kruppa, and Paul Zimmermann.

The SSA uses a Discrete Fourier Transform, but instead of complex numbers, the values are integers modulo 2^m+1 or 2^m-1 for some m. The algorithm is applied recursively, with smaller and smaller values of m.

The interesting point about SSA is that ω, the n-th root of 1, can be chosen (under certain conditions) to be a power of B. In this case the basic operation, a product by ωⁱ, reduces to a fast shift+add combination.

This is illustrated in the following example. Here we have a polynomial P with 8 coefficients a, b, ..., h, defined by P(X)=a.X⁰+b.X¹+c.X²+...+h.X⁷. Each coefficient C (a, b,...) is written on 4 "digits": C=C₀+B*C₁+B²*C₂+B³*C₃, and we want to compute the 8 values P(B⁰), P(B¹), ..., P(B⁷) modulo B⁴+1.

The stacks represent a sum of the 8 coefficients. The first row is for a, the second for b, etc. In the figure below, the stack represents P(B¹)=a.B⁰+b.B¹+c.B²+...+h.B⁷:

Stack representing P(B¹). The columns represent the digits of the sum, with weights B⁰, B¹, B², ...,B¹⁰ from right to left. The 8 rows correspond to the 8 coefficients a, b, ..., h. The rightmost digit (weight B⁰) is then a₀; the next one (weight B¹) is a₁+b₀; the leftmost digit (weight B¹⁰) is h₃.

Now, since B⁴=-1, we can wrap the coefficients, changing sign each time a B⁴ factor is removed. The P(B¹) stack is then equivalently rewritten:

Equivalent stack wrapped modulo B⁴+1. The red cells are subtracted, and the blue cells are added. We now have only 4 columns, corresponding to weights B⁰,...,B³ from right to left. The rightmost digit is then a₀-b₃-c₂-d₁-e₀+f₃+g₂+h₁.

The 8 values P(Bⁱ) can be represented by similar stacks, as shown here:

Stacks corresponding to the 8 values P(B⁰),...,P(B⁷).

As seen in the previous page, the usual FFT algorithm recursively computes the values for even and odd numbered sub-sequences.

Stacks computed recursively. s_k and t_k are then combined to obtain P(B^k) and P(B^k+4).

Example

As a more detailed complement to the Wikipedia page, let's see how we can compute the square of a 12-digit decimal number using SSA.

X = 383 886 777 915

We will represent X by a polynomial with N=8 coefficients modulo 10⁸+1. Coefficients are stored in base B=100 on four digits. The number X is then stored on 1/4 of the 32 input digits. This is required to handle correctly the degree of the output polynomial, and the magnitude of its coefficients. The coefficients of the input polynomial P are then:

P₀ = +000 +000 +009 +015
P₁ = +000 +000 +007 +077
P₂ = +000 +000 +008 +086
P₃ = +000 +000 +003 +083
P₄ = +000 +000 +000 +000
P₅ = +000 +000 +000 +000
P₆ = +000 +000 +000 +000
P₇ = +000 +000 +000 +000

The stacks used to compute the 8 values are given below. For each stack, we display the shifted and wrapped coefficients, raw sum, and the corresponding normalized value. All values modulo 10⁸+1 can be normalized with coefficients in the range 0..99, except for 10⁸, represented by +000 +000 +000 -001.

          +000 +000 +009 +015
          +000 +000 +007 +077
          +000 +000 +008 +086
          +000 +000 +003 +083
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = +000 +000 +027 +261
P(B⁰) = +000 +000 +029 +061

          +000 +000 +009 +015
          +000 +007 +077 +000
          +008 +086 +000 +000
          +083 +000 +000 -003
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = +091 +093 +086 +012
P(B¹) = +091 +093 +086 +012

          +000 +000 +009 +015
          +007 +077 +000 +000
          +000 +000 -008 -086
          -003 -083 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = +004 -006 +001 -071
P(B²) = +003 +094 +000 +029

          +000 +000 +009 +015
          +077 +000 +000 -007
          -008 -086 +000 +000
          +000 +003 +083 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = +069 -083 +092 +008
P(B³) = +068 +017 +092 +008

          +000 +000 +009 +015
          +000 +000 -007 -077
          +000 +000 +008 +086
          +000 +000 -003 -083
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = +000 +000 +007 -059
P(B⁴) = +000 +000 +006 +041

          +000 +000 +009 +015
          +000 -007 -077 +000
          +008 +086 +000 +000
          -083 +000 +000 +003
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = -075 +079 -068 +018
P(B⁵) = +025 +078 +032 +019

          +000 +000 +009 +015
          -007 -077 +000 +000
          +000 +000 -008 -086
          +003 +083 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = -004 +006 +001 -071
P(B⁶) = +096 +006 +000 +030

          +000 +000 +009 +015
          -077 +000 +000 +007
          -008 -086 +000 +000
          +000 -003 -083 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
          +000 +000 +000 +000
raw sum = -085 -089 -074 +022
P(B⁷) = +014 +010 +026 +023

The transformed vector Q, containing all values P(B^k) is then:

Q₀ +000 +000 +029 +061
Q₁ +091 +093 +086 +012
Q₂ +003 +094 +000 +029
Q₃ +068 +017 +092 +008
Q₄ +000 +000 +006 +041
Q₅ +025 +078 +032 +019
Q₆ +096 +006 +000 +030
Q₇ +014 +010 +026 +023

We now have to compute the squares of each of these values. In the real world, these would be numbers much smaller than the initial input, and the products would be handled by the most appropriate algorithm for this size: SSA for the largest ones, Karatsuba or Toom-Cook for the medium range, and trivial multiplication for the smallest ones.

Q2₀ +008 +076 +075 +021
Q2₁ +091 +095 +094 +062
Q2₂ +028 +036 +056 +003
Q2₃ +057 +002 +032 +021
Q2₄ +000 +041 +008 +081
Q2₅ +075 +035 +042 +018
Q2₆ +071 +032 +056 +008
Q2₇ +073 +049 +012 +090

To get the coefficients of the result, we now have to apply the inverse Fourier Transform. Up to a permutation and a constant factor, it can be done by applying exactly the same transformation to Q2. Doing this, we obtain the coefficients R. The number in parentheses gives the index of the coefficient in the non-permuted result:

R₀ +006 +069 +078 +000  (0)
R₁ +000 +000 +000 +000  (7)
R₂ +001 +017 +035 +012  (6)
R₃ +005 +042 +094 +008  (5)
R₄ +011 +004 +014 +024  (4)
R₅ +016 +062 +018 +072  (3)
R₆ +017 +080 +008 +072  (2)
R₇ +011 +037 +052 +080  (1)

The last step is to assemble the result, using 3 digits per coefficient. Each coefficient must first be divided by N=8:

+006 +069 +078 +000 =>                         837 225
+011 +037 +052 +080 =>                   1 421 910
+017 +080 +008 +072 =>               2 225 109
+016 +062 +018 +072 =>           2 077 734
+011 +004 +014 +024 =>       1 380 178
+005 +042 +094 +008 =>     678 676
+001 +017 +035 +012 => 146 689
+006 +069 +078 +000 =>   0
X^2 =                  147 369 058 257 960 531 747 225

Recursive evaluation

The basic operation of the recursive algorithm takes two rows x and y and combines them into x+ω^k.i.y and x-ω^k.i.y. Except for the initial step, requiring a permutation, it can be done "in-place" in memory, replacing the two input rows with the outputs. This basic operation is called a "butterfly". There are log₂(N) levels of recursion, and each level requires N/2 butterflies.

The recursive algorithm described here with N=8 coefficients. The first step (in green) is a permutation (binary inverse). The next log₂(N)=3 steps (in red) each involve 4 butterflies, with k=4, k=2, and k=1 respectively.

In our example, we would compute the following:

Input values   reverse binary permutation
+000 +000 +009 +015   ---------   +000 +000 +009 +015
+000 +000 +007 +077   -. .-----   +000 +000 +000 +000
+000 +000 +008 +086   --|------   +000 +000 +008 +086
+000 +000 +003 +083   --|--. .-   +000 +000 +000 +000
+000 +000 +000 +000   -' '--|--   +000 +000 +007 +077
+000 +000 +000 +000   ------|--   +000 +000 +000 +000
+000 +000 +000 +000   -----' '-   +000 +000 +003 +083
+000 +000 +000 +000   ---------   +000 +000 +000 +000

                   Combine 1
+000 +000 +009 +015   -..-  +000 +000 +009 +015
+000 +000 +000 +000   -''-  +000 +000 +009 +015
+000 +000 +008 +086   -..-  +000 +000 +008 +086
+000 +000 +000 +000   -''-  +000 +000 +008 +086
+000 +000 +007 +077   -..-  +000 +000 +007 +077
+000 +000 +000 +000   -''-  +000 +000 +007 +077
+000 +000 +003 +083   -..-  +000 +000 +003 +083
+000 +000 +000 +000   -''-  +000 +000 +003 +083


                     Combine 2                     Combine 3
+000 +000 +009 +015  -. .-      +000 +000 +017 +101 -.   .- +000 +000 +027 +261
+000 +000 +009 +015    X  -. .- +008 +086 +009 +015   \ /   +091 +093 +086 +012
+000 +000 +008 +086  -' '-  X   +000 +000 +001 -071    X    +004 -006 +001 -071
+000 +000 +008 +086       -' '- -008 -086 +009 -015   / \   +069 -083 +092 +008
+000 +000 +007 +077  -. .-      +000 +000 +010 +160 -'   '- +000 +000 +007 -059
+000 +000 +007 +077    X  -. .- +003 +083 +007 +077    .    -075 +079 -068 +018
+000 +000 +003 +083  -' '-  X   +000 +000 +004 -006    .    -004 +006 +001 -071
+000 +000 +003 +083       -' '- -003 -083 +007 +077    .    -085 -089 -074 +022

In the next page, we will experiment implementations of the SSA in OpenCL.