Cache Unfriendly Reduction

Given

var a : [1..n, 1..m] real(w); // 5<n<5000, 5<m<5000
var r : [1..n] real(w);
var c : [1..m] real(w);

and assuming both a and r are defined, the operation

forall j in 1..m do
{
    c[j] = max reduce abs(a[1..n, j] * r[1..n]);
}

is very cache unfriendly.

The following is cache friendly but serial

c = 0.0;
for i in 1..n do
{
   const ri = r[i];

   for j in 1..m do
   {
      c[j] = max(c[j], abs(a[i, j] * ri));
   }
}

Yes, you can parallelize the internal loop but it has lots of parallel overhead.

Is there a partial reduction or a with clause that will buy me a parallel operation that does not cause lots of cache misses and not have huge overhead.

Something like, where R is a generic real(w),

inline proc fudgeA(const ref r : [] ?R, const ref a : [?D] R)
{
    const (_, columns) = D.dims();
    var c : [columns] R = 0:R;

    forall (i,j) in D with (max reduce c) do
    {
        c[j] = max(c[j], r[i] * a[i, j]);
    }
    return c;
}
//OR
inline proc fudgeB(const ref r : [] ?R, const ref a : [?D] R)
{
    const (_, columns) = D.dims();
    var c : [columns] R = 0:R;

    forall (i,j) in D with (max reduce c) do
    {   
        const ra = abs(r[i] * a[i, j]);
        const cj = c[j];

        c[j] = if ra > cj then ra else cj;  
    }
    return c;
}

I am really stabbing in the dark here. SORRY. Edits were needed

Or maybe:

inline proc fudgeX(const ref r : [] ?R, const ref a : [?D] R)
{
    const (rows, columns) = D.dims();
    var c : [columns] R = 0:R;

    forall (i, j) in D with (max reduce c) do
    {
        c[j] max= abs(a[i, j] * r[i]);
    }
    return c;
}

That is, if max= was a valid operation

Hi Damian,

Partial reductions isn't something we currently have language features for though it's certainly something that would be nice to add. Digging through GitHub we do have a couple issues about partial reductions:

So, partial reductions is something that's on our radar but it's not something we've heavily discussed or worked on yet. If this is a major annoyance/hindrance to you or it's something that shows up often in your codes, do let us know so we can use that feedback when we set priorities).

Back to the topic: thinking about how to do a 2d columnwise partial reduction in parallel, I imagine a cache friendly parallel version would do something akin to this (distributing blocks of rows of a among tasks, computing an intermediate result of the reduction for each, and then reducing the intermediate results):

var c_local : [0..<here.maxTaskPar, 0..<n] real;
const rowsPerTask = n / here.maxTaskPar;
coforall blk in 0..<here.maxTaskPar with (ref c_local) {
  for i in blk*rowsPerTask..#rowsPerTask {
    for j in 0..<n do
      c_local[blk, j] = max(c_local[blk, j], a[i, j]);
  }
}
for j in 0..<n do
  c[j] = max reduce c_local[.., j];

Though that's certainly not the most elegant Chapel code and I imagine things would good harrier if we tried to generalize it or distribute it across nodes (so having some higher level construct would be nice.)

For fun, I spent a bit of time playing with your example and looking at cache behavior.

use Time;

config param n = 45000;
config param w = 64;
config const alg = 0;

var a : [0..<n, 0..<n] real(w);
var r : [0..<n] real(w);
var c : [0..<n] real(w);

var watch : stopwatch;

watch.start();

if alg == 0 {
  writeln("--- parallel, bad cache behavior ---");
  forall j in 0..<n do c[j] = max reduce abs(a[.., j] * r[..]);
} else if alg == 1 {
  writeln("--- serial, bad cache behavior ---");
  for j in 0..<n do c[j] = max reduce abs(a[.., j] * r[..]);
} else if alg == 2 {
  writeln("--- serial, good cache behavior ---");
  c = 0.0;
  for i in 0..<n
  {
     const ri = r[i];
     for j in 0..<n do c[j] = max(c[j], abs(a[i, j] * ri));
  }
} else if alg == 3 {
  writeln("--- parallel, good cache behavior ---");
  var c_local : [0..<here.maxTaskPar, 0..<n] real(w);
  const rowsPerTask = n / here.maxTaskPar;
  coforall blk in 0..<here.maxTaskPar with (ref c_local) {
    for i in blk*rowsPerTask..#rowsPerTask {
      const ri = r[i];
      for j in 0..<n do c_local[blk, j] = max(c_local[blk, j], abs(a[i, j] * ri));
    }
  }
  for j in 0..<n do
    c[j] = max reduce c_local[.., j];
}

watch.stop();
writeln("Elapsed time: ", watch.elapsed());

And I see these results on my 6-core Intel desktop if I run the program through perf to see cache-misses:

--- parallel, bad cache behavior ---
Elapsed time: 15.6183
     5,952,579,266      cache-misses:u

--- serial, bad cache behavior ---
Elapsed time: 4.45748
     1,525,807,220      cache-misses:u

--- serial, good cache behavior ---
Elapsed time: 0.987536
       244,644,652      cache-misses:u

--- parallel, good cache behavior ---
Elapsed time: 0.513626
       253,860,618      cache-misses:u

(where the "serial, bad cache behavior" takes your original example and uses a 'for' loop instead and the "parallel, good cache behavior" one is my suggested inelegant solution). As expected the "good cache behavior' versions run faster and have less cache misses. (I don't have a good explanation for it) but also interesting that the "serial bad cache behavior" version has much less misses than the parallel version and that the "parallel, good cache behavior" doesn't see much benefit as I scale the number of cores (it gets a similar benefit if I set maxTaskPar to 2).

I was looking to partial reductions being a way to concisely describe a way to describe an operation very succinctly and clearly. I was just curious as t the current timeline for handling that feature .

You can actually get better performance that what you have done, but none of my code examples are as tight (or as transparent) as I would believe a partial reduction to be. The blocking (that you and also I do) is blatantly obvious. A partial reduction would hide this and expose just the mathematics which makes the code far more readable and probably produce something more optimal.

... FIX coming soon (sorry)

Having done a lot of testing, my changes do not add much value to the discussion so I might leave them. One day I might document the scenario in Chapel constructs and publish it. Avoiding cache misses is important.

What works today is adequate and your example certainly shows succinctly how to address performance issues. I can wait for partial reductions to make my code clearer/cleaner.

My main takeaway from this discussion is that a reduce in a deeply nested loop scenario consumes parallel resources without providing performance gains.

Maybe we need a vreduce (or reducev) operator which implements serial reductions using SIMD/sector instructions or at least foreach so it does not try and consume parallelization resources. My 2c.

FYI - a routine which handles matrix sizes which are not multiples of maxTaskPar

proc maxOverColumns(a : [?D] real(?w), r : [] real(w))
{
    const (W, C) = D.dims();
    const B = 0..<here.maxTaskPar;
    const rowsPerTask = W.size / B.size;
    const wlo = W.low;
    var c : [B, W] real(w);

    forall b in B do
    {
        foreach (i, j) in {(wlo+rowsPerTask*b)..#rowsPerTask, C} do
        {
           c[b, j] = max(c[b, j], abs(a[i, j] * r[i]));
        }
    }
    foreach (i, j) in {(wlo+rowsPerTask*B.size)..W.high, C} do
    {
        c[0, j] = max(c[0, j], abs(a[i, j] * r[i]));
    }
    return [j in W] max reduce c[.., j];
}

Note that the pretty printing does not handle a foreach keyword as far as I can see.

There are some flaws in the previous code chunk which I thought somebody would have noticed and mentioned. That code uses the vectorization paradigm of processing each block of rows of data in two parts, most of the rows in the first loop (*foall b in) each of which processes rowsPerTask rows, and the rump of rows in the second loop (the last foreach loop), the leftovers. However, that last loop adds overhead.

A likely better idea is likely to be one which spreads the elements in the rump across some of those processes in the forall loop. If the approach was to allocate one for each of the leftovers, and the scenario had maxTaskPar of 24, rowsPerTask of 100 and 11 leftovers, then the first 11 parallel threads would handle 100+1=101 rows and the remaining 13 parallel threads would handle 100+0=100 rows. The altered code might be:

proc maxOverColumns(a : [?D] real(?w), r : [] real(w))
{
    const (R, C) = D.dims();
    const B = 0 ..< min(R.size, here.maxTaskPar);
    const rowsPerTask = R.size / B.size;
    const tail = R.size - rowsPerTask * B.size;
    var _c : [B, C] real(w);

    forall b in B do
    {
        const (_more, _done) = if b < tail then (1, 0) else (0, tail);
        const _rows = rowsPerTask + _more;
        const _R = R.low + _rows * b + _done .. #_rows;

        foreach (i, j) in {_R, C} do
        {
            _c[b, j] = max(_c[b, j], abs(a[i, j] * r[i]));
        }
    }
    return [j in C] max reduce _c[.., j];
}

Is there a more effective way? I would certainly like to know it?

Note that if the content of each block, or rowsPerTask*n real(w)s does not fit in the cache available to each core allocated to a parallel thread, life is going to still suffer from cache misses and that inner loop needs to be split into chunks to make it fit in the cache.

Given ranges, R, C and B, for rows and columns and blocks, and

            var _c : [B, C] real(w);
            var r : [R] real(w);
            var a : [R, C] real(w);

what is the tightest way to write the following:

coforall b in B with (ref _c)
{
      foreach (i, j) in {edge + rowsPerTask * b .. #rowsPerTask, C}
      {
          _c[b, j] = max(_c[b, j], abs(a[i, j] * r[i]));
      }
}

Also, where is some documentation talking about how to use the with statement in a coforall? I cannot find any but then again, my search skills may be awful. Should I declare

ref _cb = _c[b, ..];

before the foreach loop and then reference _cb[j] within that loop.

Summarising performance:

If a serial implementation using reductions in the innermost loop is treated as the baseline, then

  • parallelizing the outer loop with 24 cores without any consideration of cache misses, the naive solution, shrinks performance to about 40% of the baseline

  • parallelizing the outer loop with 24 cores but blocking the inner loop, the smarter solution as noted by @stonea, pushes performance to about a four-fold gain over the baseline, or 10 times better than the naive approach.

So, throwing 24 cores at the problem buys a 4-fold gain. Hmmm. It is a tough problem to parallelize.

Luckily the whole concept of row and column equilibration to improve the accuracy of the solution of linear equations is only a small part of the overall problem. So, it is not a critical part of the overall task but it is an interesting problem that serves to highlight parallel programming techniques with Chapel in LInear Algebra and is a useful example for those learning the language and its constructs.