Hi Damian,
Partial reductions isn't something we currently have language features for though it's certainly something that would be nice to add. Digging through GitHub we do have a couple issues about partial reductions:
So, partial reductions is something that's on our radar but it's not something we've heavily discussed or worked on yet. If this is a major annoyance/hindrance to you or it's something that shows up often in your codes, do let us know so we can use that feedback when we set priorities).
Back to the topic: thinking about how to do a 2d columnwise partial reduction in parallel, I imagine a cache friendly parallel version would do something akin to this (distributing blocks of rows of a
among tasks, computing an intermediate result of the reduction for each, and then reducing the intermediate results):
var c_local : [0..<here.maxTaskPar, 0..<n] real;
const rowsPerTask = n / here.maxTaskPar;
coforall blk in 0..<here.maxTaskPar with (ref c_local) {
for i in blk*rowsPerTask..#rowsPerTask {
for j in 0..<n do
c_local[blk, j] = max(c_local[blk, j], a[i, j]);
}
}
for j in 0..<n do
c[j] = max reduce c_local[.., j];
Though that's certainly not the most elegant Chapel code and I imagine things would good harrier if we tried to generalize it or distribute it across nodes (so having some higher level construct would be nice.)
For fun, I spent a bit of time playing with your example and looking at cache behavior.
use Time;
config param n = 45000;
config param w = 64;
config const alg = 0;
var a : [0..<n, 0..<n] real(w);
var r : [0..<n] real(w);
var c : [0..<n] real(w);
var watch : stopwatch;
watch.start();
if alg == 0 {
writeln("--- parallel, bad cache behavior ---");
forall j in 0..<n do c[j] = max reduce abs(a[.., j] * r[..]);
} else if alg == 1 {
writeln("--- serial, bad cache behavior ---");
for j in 0..<n do c[j] = max reduce abs(a[.., j] * r[..]);
} else if alg == 2 {
writeln("--- serial, good cache behavior ---");
c = 0.0;
for i in 0..<n
{
const ri = r[i];
for j in 0..<n do c[j] = max(c[j], abs(a[i, j] * ri));
}
} else if alg == 3 {
writeln("--- parallel, good cache behavior ---");
var c_local : [0..<here.maxTaskPar, 0..<n] real(w);
const rowsPerTask = n / here.maxTaskPar;
coforall blk in 0..<here.maxTaskPar with (ref c_local) {
for i in blk*rowsPerTask..#rowsPerTask {
const ri = r[i];
for j in 0..<n do c_local[blk, j] = max(c_local[blk, j], abs(a[i, j] * ri));
}
}
for j in 0..<n do
c[j] = max reduce c_local[.., j];
}
watch.stop();
writeln("Elapsed time: ", watch.elapsed());
And I see these results on my 6-core Intel desktop if I run the program through perf
to see cache-misses:
--- parallel, bad cache behavior ---
Elapsed time: 15.6183
5,952,579,266 cache-misses:u
--- serial, bad cache behavior ---
Elapsed time: 4.45748
1,525,807,220 cache-misses:u
--- serial, good cache behavior ---
Elapsed time: 0.987536
244,644,652 cache-misses:u
--- parallel, good cache behavior ---
Elapsed time: 0.513626
253,860,618 cache-misses:u
(where the "serial, bad cache behavior" takes your original example and uses a 'for' loop instead and the "parallel, good cache behavior" one is my suggested inelegant solution). As expected the "good cache behavior' versions run faster and have less cache misses. (I don't have a good explanation for it) but also interesting that the "serial bad cache behavior" version has much less misses than the parallel version and that the "parallel, good cache behavior" doesn't see much benefit as I scale the number of cores (it gets a similar benefit if I set maxTaskPar to 2).