CPU/GPU Multiprecision Mandelbrot Set

Eric Bainville - Dec 2009

Fixed-point reals

Fixed-point (FP) reals are represented by a structure like the following:

struct FPReal
{
  int sign;     // -1,0,+1
  uint32 m[N];  // N 32-bit unsigned words
};

The corresponding value is 0 if sign is 0, and otherwise (-1)^signΣm[k].2^-32k for k=0..N-1. In other terms, the integral part of the real is m[0], and the 32 bits of m[N-1] are the least significant bits of the fractional part.

For Mandelbrot Set computations, the limitation to 32 bits for the integral part is not an issue. The IEEE double type stores values on a 53-bit mantissa, and for N=4 we nearly double the precision with 96 bits for the fractional part.

The operations are implemented in two layers. The lower layer, similar to GMP mpn layer, provides basic operations on arrays of uint32. In the sources, the corresponding file is MPBase.h for the C code, and mp_base.cl for the OpenCL code. In the following, X, Y, Z are N-words arrays, and k is an integer.

zeroN(N,Z): set all words of Z to 0.
copyN(N,Z,X): copy X to Z.
addN(N,Z,X,in): compute Z+=X with in/out carry.
subN(N,Z,X,in): compute Z-=X with in/out borrow.
mulK(N,Z,k,in): compute Z*=k with in/out carry, k is a single word.
divK(N,Z,k,in): compute Z/=k with in/out remainder, k is a single word.
shlK(N,Z,k,in): compute Z<<=k with in/out bits, k is 0..31.
shrK(N,Z,k,in): compute Z>>=k with in/out bits, k is 0..31.
cmpN(N,X,Y): compare X and Y.
isZeroN(N,Z): test if all bits of Z are 0.
msbN(N,Z): get index of the most significant bit of Z.

These functions are implemented in C and OpenCL on 32-bit words, extended to 64 bits to handle carry propagation and integer multiplication. The C implementation is almost identical to the OpenCL implementation, and does not use assembler code or intrinsics.

The FPReal type is implemented as a C++ class for the native code, and structure+functions for the OpenCL code. The operators are trivially implemented by calling the lower level functions.

The product is implemented using two loops, the simplest but least efficient algorithm. Products of same weight are accumulated in an array of 64-bit words, and carry is propagated at the end of the computation. The product is truncated to the most significant N words:

// Multiply the two inputs x->m[N] and y->m[N]
// and accumulate the products in aux[N]
uint64 aux[FPWords];
for (int i=0;i<N;i++) aux[i] = 0;
for (int i=0;i<N;i++) for (int j=0;j<N;j++)
{
  int k = i+j;
  if (k > N) continue; // ignored
  uint64 u1 = (uint64)(x->m[i]) * (uint64)(y->m[j]);
  uint64 u0 = u1 & WordMask; // lower 32 bits, index K
  u1 >>= (uint64)32; // higher 32 bits, index K-1
  if (k < N) aux[k] += u0;
  if (k > 0) aux[k-1] += u1;
}

// Propagate carry into result z->m[N]
uint64 c = 0;
for (int i=N;i>=0;i--)
{
  c += aux[i];
  z->m[i] = (uint32)(c & WordMask);
  c >>= (uint64)32;
}

To avoid kernels running for too long, and then issues with the system watchdog timer interrupting the computation, it is wise to subdivide the image into blocks, and run one kernel for each block. Here we used 1x1 block for float or double kernels, and 16x16 blocks for software fixed-point kernels. See the code in MandelbrotOpenCL.cpp.

Running this code as an OpenCL kernel turned out te be quite disappointing (DNF = Did Not Finish):

Software fp128 (first version, Nov 2009 drivers, not updated)
Device, Code	Mini Set 1		Julia Island
	Linux	Windows	Linux	Windows
Q9550, C x8	4.53s	3.76s	372s	366s
Core i7, C x8	3.50s	3.00s	302s	258s
Core i7, OpenCL	4.65s	6.80s	351s	469s
GTX285, OpenCL	48.6s	48.7s	DNF	DNF
HD5870, OpenCL	DNF	DNF	-	-

On software fixed-point numbers with the ported C code, the CPU has a clear advantage. In both cases, the Visual C++ multithreaded code is the fastest. Note the honorable performance of the AMD OpenCL CPU implementation.

Running these kernels on the GPU is much more problematic. Julia Island crashes or freezes the NVidia drivers, and both benchmarks won't even run correctly on the AMD drivers. When the computation terminates correctly, the GPU is 10x slower than the CPU!

Simply porting moderately complex C code to OpenCL code is not enough, at least in the current state of the drivers, and probably in the general case. Rewriting the code is required to beat CPU performance.

In the next section, we will rewrite the fp128 fixed-point data type using OpenCL vector types: fp128 for OpenCL.