How fast can we compute 1D gradient?

Eric Bainville - Oct 2009

C code

Let's start with a trivial implementation in C:

for (int i=0;i<n;i++)
{
  if (i == 0) { out[i] = 0; continue; }
  if (i == n-1) { out[i] = 0; continue; }
  out[i] = in[i+1] - in[i-1];
}

It runs at 4.2 cycles/float for /arch:none and /arch:SSE, and 6.2 cycles/float for /arch:SSE2 with /fp:precise. The code generated with /arch:none and /arch:SSE uses fld,fsub,fstp. The code generated with /arch:SSE2 and /fp:precise uses movss and conversions to double before and after a call to subsd. With /arch:SSE2 and /fp:fast, the code runs at 4.2 cycles/float, and uses movss,subss,movss as expected.

For the next attemps, we will use only the options /O2 /arch:SSE2 /fp:fast to compile the C code, and use SSE2 instructions instead of the x87 floating-point instruction set.

The first optimization would be to remove the tests from inside the loop. Even if the loop prediction algorithms of the CPU almost eliminate the missed branch penalties, it is better to eliminate the tests, and keep the loop code short:

out[0] = out[n-1] = 0;
for (int i=n-2;i>0;i--)
{
  out[i] = in[i+1] - in[i-1];
}

This code runs at 2.0 cycles/float. We can try to unroll the loop and compute two values per iteration:

out[0] = out[n-1] = 0;
for (int i=n-3;i>0;i-=2)
{
  out[i] = in[i+1] - in[i-1];
  out[i+1] = in[i+2] - in[i];
}

There is no speed improvement here, we remain at 2.0 cycles/float. Let's try to read each input value only once:

out[0] = out[n-1] = 0;
float x0,x1,x2;
x2 = in[n-1];
x1 = in[n-2];
for (int i=n-2;i>0;i--)
{
  x0 = in[i-1]; // x2=in[i+1] x1=in[i] x0=in[i-1]
  out[i] = x2 - x0;
  x2 = x1;
  x1 = x0;
}

This one runs at 1.6 cycles/float. Let's unroll this loop:

out[0] = out[n-1] = 0;
float x0,x1,x2,x3;
x3 = in[n-1];
x2 = in[n-2];
for (int i=n-3;i>0;i-=2)
{
  x1 = in[i];
  x0 = in[i-1];  // x3=in[i+2] x2=in[i+1] x1=in[i] x0=in[i-1]
  out[i+1] = x3 - x1;
  out[i] = x2 - x0;
  x3 = x1;
  x2 = x0;
}

This code is actually slower, and runs at 2.0 cycles/float. In the following pages, we are going to experiment with SIMD SSE/SSE2 instructions, starting with a simple problem related to the gradient: shifting a float vector. The next page presents in more details the SSE instructions used to move floats around.