GPU Benchmarks
Eric Bainville - Nov 2009Introduction
After the assembly experiments on the CPU, we see in these pages how we can program a GPU using OpenCL to perform multiprecision arithmetic. This first part is focused on measuring the potential speed of multithread CPU and GPU for multiprecision computations.
A GPU provides a highly parallel architecture, initially dedicated to the fixed 3D rendering pipeline. In the previous years, parts of the pipeline became more and more programmable, and now rendering is running on a set of "general purpose" processing cores. On modern GPU, these cores can be used directly for general purpose computations.
This article intends to provide a fair comparison of recent CPU and GPU in realistic conditions (i.e. actually computing something). For each target, we will try to provide the fastest implementation of each algorithm, using multithreading and vector instructions when needed.
Update (May 2010). I have updated this article with tests using new versions of the drivers.
Contents
Memory operations - Basic memory operations: copy and set to 0.
Addition - How multiprecision addition can be implemented on highly parallel architectures.
Available Flops - Raw processing power of the CPU and the GPU.
Product by one digit - Multiply a multiprecision number and a single digit.
OpenCL, hardware, drivers, and software
GPU programming
Several API allow the execution of code on the GPU:
- Stream (low level, ATI only),
- CUDA (low level, NVIDIA only),
- OpenGL using GLSL shaders,
- DirectX using Compute (Windows only),
- OpenCL (GPU and multi-core CPU).
OpenCL is the only dedicated API running on all systems and hardware, and we will use it in these pages.
Original post (Nov 2009). Today, OpenCL on the GPU is still mainly in Beta. NVidia has released a public Beta of their main driver series featuring OpenCL (Windows release 195.39, Linux release 195.17). AMD has released mid-October a Beta version (Stream 2.0 beta 4) running on both GPU and CPU.
Update (May 2010). OpenCL support by NVidia is now mature. Runtime support is released as part of the public driver series, and SDK support is integrated in the CUDA toolkit. On the AMD side, despite a potentially superior hardware (at least until Fermi was released), OpenCL software support was below expectations, even if it has been dramatically improved since Nov 2009. The 2.1 Stream SDK and the Catalyst 10.4 drivers released in May 2010 now provide much more features (image support, etc.) and better performance.
OpenCL is not only for GPU, and OpenCL for the CPU allows the use of all the power of modern multicores processors without having to manage threads and SSE instructions explicitely. The AMD OpenCL drivers are the only ones to provide CPU support, and the performance is comparable to threads+SSE code that would take much longer to write (not speaking about maintaining and porting it).
In the curves, I removed the old HD5870 Linux measures, since they have probably improved a lot with the new drivers (as they did on Windows). I still have to update them.
Test systems
I run the tests on two machines:
Machine A: CPU Intel Core i7 920 (4 cores, 8 threads) @3.33 GHz (overclocked) Chipset Intel X58 6GB of DDR3 @1.33 GHz GPU ATI Radeon HD5870 1GB Machine B: CPU Intel Core 2 Quad Q9550 (4 cores) @2.83 GHz (stock speed) Chipset Intel P45 12GB of DDR2 @800 MHz GPU NVidia GTX285 1GB
On each machine I run the tests on two systems:
Linux 64-bit kernel-2.6.32 glibc-2.10.1 gcc-4.3.4 NVidia driver 195.36.24 + CUDA toolkit 3.0 ATI driver 2.0-beta4 + Stream SDK 2.0 beta 4 (not updated yet) Windows 7 64-bit vs2008-sp1 NVidia dev driver 197.13 + CUDA toolkit 3.0 ATI Catalyst 10.4 + Stream SDK 2.1To avoid ambiguity, we adopt the (standard) conventions:
1 KiB = 210 B, 1 MiB = 220 B, 1 GiB = 230 B 1 KB = 103 B, 1 MB = 106 B, 1 GB = 109 B
We measure the effective wallclock time (not the device execution time reported by event profiling), because it is what matters to the user sitting in front of the machine.
Before entering the subject and effectively operate on large integers, we will evaluate the memory and computational power of both the GPU and the CPU. The next page is devoted to memory copy and zero operations.
Source code
The source code is available here: MPBenchmarks-20091214.zip (41 KB). It contains all benchmarks (CPU and GPU) described on these pages. You will also find a few classes of (incomplete) C++ encapsulation of the OpenCL objects. The design is a little different than the design of the "official" C++ wrapper, and some may or may not like it.
GPU Mandelbrot Set | Top of Page | GPU Benchmarks : Memory operations |