# OpenCL Multiprecision

Eric Bainville - Jan 2010

## The Schönhage-Strassen Algorithm (SSA)

### Combination patterns

For more details on the SSA, refer to the Wikipedia page, and the paper A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm by Pierrick Gaudry, Alexander Kruppa, and Paul Zimmermann.

The SSA uses a Discrete Fourier Transform, but instead of complex numbers, the values are integers modulo 2m+1 or 2m-1 for some m. The algorithm is applied recursively, with smaller and smaller values of m.

The interesting point about SSA is that ω, the n-th root of 1, can be chosen (under certain conditions) to be a power of B. In this case the basic operation, a product by ωi, reduces to a fast shift+add combination.

This is illustrated in the following example. Here we have a polynomial P with 8 coefficients a, b, ..., h, defined by P(X)=a.X0+b.X1+c.X2+...+h.X7. Each coefficient C (a, b,...) is written on 4 "digits": C=C0+B*C1+B2*C2+B3*C3, and we want to compute the 8 values P(B0), P(B1), ..., P(B7) modulo B4+1.

The stacks represent a sum of the 8 coefficients. The first row is for a, the second for b, etc. In the figure below, the stack represents P(B1)=a.B0+b.B1+c.B2+...+h.B7:

Stack representing P(B1). The columns represent the digits of the sum, with weights B0, B1, B2, ...,B10 from right to left. The 8 rows correspond to the 8 coefficients a, b, ..., h. The rightmost digit (weight B0) is then a0; the next one (weight B1) is a1+b0; the leftmost digit (weight B10) is h3.

Now, since B4=-1, we can wrap the coefficients, changing sign each time a B4 factor is removed. The P(B1) stack is then equivalently rewritten:

Equivalent stack wrapped modulo B4+1. The red cells are subtracted, and the blue cells are added. We now have only 4 columns, corresponding to weights B0,...,B3 from right to left. The rightmost digit is then a0-b3-c2-d1-e0+f3+g2+h1.

The 8 values P(Bi) can be represented by similar stacks, as shown here:

Stacks corresponding to the 8 values P(B0),...,P(B7).

As seen in the previous page, the usual FFT algorithm recursively computes the values for even and odd numbered sub-sequences.

Stacks computed recursively. sk and tk are then combined to obtain P(Bk) and P(Bk+4).

### Example

As a more detailed complement to the Wikipedia page, let's see how we can compute the square of a 12-digit decimal number using SSA.

```X = 383 886 777 915
```

We will represent X by a polynomial with N=8 coefficients modulo 108+1. Coefficients are stored in base B=100 on four digits. The number X is then stored on 1/4 of the 32 input digits. This is required to handle correctly the degree of the output polynomial, and the magnitude of its coefficients. The coefficients of the input polynomial P are then:

```P0 = +000 +000 +009 +015
P1 = +000 +000 +007 +077
P2 = +000 +000 +008 +086
P3 = +000 +000 +003 +083
P4 = +000 +000 +000 +000
P5 = +000 +000 +000 +000
P6 = +000 +000 +000 +000
P7 = +000 +000 +000 +000
```

The stacks used to compute the 8 values are given below. For each stack, we display the shifted and wrapped coefficients, raw sum, and the corresponding normalized value. All values modulo 108+1 can be normalized with coefficients in the range 0..99, except for 108, represented by +000 +000 +000 -001.

 ``` +000 +000 +009 +015 +000 +000 +007 +077 +000 +000 +008 +086 +000 +000 +003 +083 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = +000 +000 +027 +261 P(B0) = +000 +000 +029 +061 +000 +000 +009 +015 +000 +007 +077 +000 +008 +086 +000 +000 +083 +000 +000 -003 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = +091 +093 +086 +012 P(B1) = +091 +093 +086 +012 +000 +000 +009 +015 +007 +077 +000 +000 +000 +000 -008 -086 -003 -083 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = +004 -006 +001 -071 P(B2) = +003 +094 +000 +029 +000 +000 +009 +015 +077 +000 +000 -007 -008 -086 +000 +000 +000 +003 +083 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = +069 -083 +092 +008 P(B3) = +068 +017 +092 +008 ``` ``` +000 +000 +009 +015 +000 +000 -007 -077 +000 +000 +008 +086 +000 +000 -003 -083 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = +000 +000 +007 -059 P(B4) = +000 +000 +006 +041 +000 +000 +009 +015 +000 -007 -077 +000 +008 +086 +000 +000 -083 +000 +000 +003 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = -075 +079 -068 +018 P(B5) = +025 +078 +032 +019 +000 +000 +009 +015 -007 -077 +000 +000 +000 +000 -008 -086 +003 +083 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = -004 +006 +001 -071 P(B6) = +096 +006 +000 +030 +000 +000 +009 +015 -077 +000 +000 +007 -008 -086 +000 +000 +000 -003 -083 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 +000 raw sum = -085 -089 -074 +022 P(B7) = +014 +010 +026 +023 ```

The transformed vector Q, containing all values P(Bk) is then:

```Q0 +000 +000 +029 +061
Q1 +091 +093 +086 +012
Q2 +003 +094 +000 +029
Q3 +068 +017 +092 +008
Q4 +000 +000 +006 +041
Q5 +025 +078 +032 +019
Q6 +096 +006 +000 +030
Q7 +014 +010 +026 +023
```

We now have to compute the squares of each of these values. In the real world, these would be numbers much smaller than the initial input, and the products would be handled by the most appropriate algorithm for this size: SSA for the largest ones, Karatsuba or Toom-Cook for the medium range, and trivial multiplication for the smallest ones.

```Q20 +008 +076 +075 +021
Q21 +091 +095 +094 +062
Q22 +028 +036 +056 +003
Q23 +057 +002 +032 +021
Q24 +000 +041 +008 +081
Q25 +075 +035 +042 +018
Q26 +071 +032 +056 +008
Q27 +073 +049 +012 +090
```

To get the coefficients of the result, we now have to apply the inverse Fourier Transform. Up to a permutation and a constant factor, it can be done by applying exactly the same transformation to Q2. Doing this, we obtain the coefficients R. The number in parentheses gives the index of the coefficient in the non-permuted result:

```R0 +006 +069 +078 +000  (0)
R1 +000 +000 +000 +000  (7)
R2 +001 +017 +035 +012  (6)
R3 +005 +042 +094 +008  (5)
R4 +011 +004 +014 +024  (4)
R5 +016 +062 +018 +072  (3)
R6 +017 +080 +008 +072  (2)
R7 +011 +037 +052 +080  (1)
```

The last step is to assemble the result, using 3 digits per coefficient. Each coefficient must first be divided by N=8:

```+006 +069 +078 +000 =>                         837 225
+011 +037 +052 +080 =>                   1 421 910
+017 +080 +008 +072 =>               2 225 109
+016 +062 +018 +072 =>           2 077 734
+011 +004 +014 +024 =>       1 380 178
+005 +042 +094 +008 =>     678 676
+001 +017 +035 +012 => 146 689
+006 +069 +078 +000 =>   0
X^2 =                  147 369 058 257 960 531 747 225
```

### Recursive evaluation

The basic operation of the recursive algorithm takes two rows x and y and combines them into x+ωk.i.y and x-ωk.i.y. Except for the initial step, requiring a permutation, it can be done "in-place" in memory, replacing the two input rows with the outputs. This basic operation is called a "butterfly". There are log2(N) levels of recursion, and each level requires N/2 butterflies.

The recursive algorithm described here with N=8 coefficients. The first step (in green) is a permutation (binary inverse). The next log2(N)=3 steps (in red) each involve 4 butterflies, with k=4, k=2, and k=1 respectively.

In our example, we would compute the following:

```Input values   reverse binary permutation
+000 +000 +009 +015   ---------   +000 +000 +009 +015
+000 +000 +007 +077   -. .-----   +000 +000 +000 +000
+000 +000 +008 +086   --|------   +000 +000 +008 +086
+000 +000 +003 +083   --|--. .-   +000 +000 +000 +000
+000 +000 +000 +000   -' '--|--   +000 +000 +007 +077
+000 +000 +000 +000   ------|--   +000 +000 +000 +000
+000 +000 +000 +000   -----' '-   +000 +000 +003 +083
+000 +000 +000 +000   ---------   +000 +000 +000 +000

Combine 1
+000 +000 +009 +015   -..-  +000 +000 +009 +015
+000 +000 +000 +000   -''-  +000 +000 +009 +015
+000 +000 +008 +086   -..-  +000 +000 +008 +086
+000 +000 +000 +000   -''-  +000 +000 +008 +086
+000 +000 +007 +077   -..-  +000 +000 +007 +077
+000 +000 +000 +000   -''-  +000 +000 +007 +077
+000 +000 +003 +083   -..-  +000 +000 +003 +083
+000 +000 +000 +000   -''-  +000 +000 +003 +083

Combine 2                     Combine 3
+000 +000 +009 +015  -. .-      +000 +000 +017 +101 -.   .- +000 +000 +027 +261
+000 +000 +009 +015    X  -. .- +008 +086 +009 +015   \ /   +091 +093 +086 +012
+000 +000 +008 +086  -' '-  X   +000 +000 +001 -071    X    +004 -006 +001 -071
+000 +000 +008 +086       -' '- -008 -086 +009 -015   / \   +069 -083 +092 +008
+000 +000 +007 +077  -. .-      +000 +000 +010 +160 -'   '- +000 +000 +007 -059
+000 +000 +007 +077    X  -. .- +003 +083 +007 +077    .    -075 +079 -068 +018
+000 +000 +003 +083  -' '-  X   +000 +000 +004 -006    .    -004 +006 +001 -071
+000 +000 +003 +083       -' '- -003 -083 +007 +077    .    -085 -089 -074 +022
```

In the next page, we will experiment implementations of the SSA in OpenCL.