Hi Damian —

Sorry for the delay getting back to this thread. Here's how I'd propose you write this routine in Chapel as it stands today, with a reasonable combination of safety, elegance, and performance:

```
proc fudge(h : ?R, const ref x : [] R, const ref z : [?zD] R, param approach) {
const (rows, columns) = zD.dims();
var y : [columns] R;
forall (r,c) in zD with (+ reduce y) do
y[c] += x[r] * z[r, c];
y /= h;
return y;
}
```

The approach here is roughly:

- use a forall so that the 2D array can be iterated over in parallel and in the way that the array author thinks is best (e.g., cache friendly)
- use a
`with`

-clause on the forall to give each task its own copy of the result vector `y`

so that (a) there are no race conditions due to trying to update a shared vector in parallel, and (b) there's no battling over the cache line that a single shared vector would live on

I think this solution is also attractive in that it's very similar to what I expect we will write ourselves for this case once we've actually finished adding our partial reduction support.

I arrived at this solution by writing a number of variations, some legal, some intentionally illegal, and timing them:

```
use Time;
// Different ways of writing this loop
enum approaches {ser, naiveForall, naiveForallExpr, taskPrivPar, atomicPar};
use approaches;
// matrix size and what optional operations to do
config const m = 3, n = 5;
config const printArrays = false, verify=true;
// define our matrix and vector
var A: [1..m, 1..n] int;
var v: [1..m] int;
// fill them with silly, but unique values
forall (i,j) in A.domain do
A[i,j] = i*1000000 + j;
for i in 1..m do
v[i] = i;
// create a serial reference result vector
var r = fudge(1, v, A, ser);
if printArrays then
writeln("serial result is:\n", r);
// test the other approaches
testApproach(naiveForall);
testApproach(naiveForallExpr);
testApproach(taskPrivPar);
testApproach(atomicPar);
// to test an approach, we'll time it, optionally print the array,
// and then make sure it matches our reference result
proc testApproach(param approach) {
var t: Timer;
t.start();
var s = fudge(1, v, A, approach);
t.stop();
writeln("approach ", approach, " took: ", t.elapsed());
if printArrays then
writeln("parallel result using ", approach, " is:\n", s);
if verify {
forall i in r.domain do
if r[i] != s[i] then
writeln("Mismatch at element ", i, " using ", approach, ": ",
r[i], " != ", s[i]);
} else {
if approach == naiveForall || approach == naiveForallExpr then
writeln("NOTE: ", approach, " isn't guaranteed to give correct results");
}
}
// Here's the implementation of the five approaches
proc fudge(h : ?R, const ref x : [] R, const ref z : [?zD] R, param approach) {
const (rows, columns) = zD.dims();
var y : [columns] R;
select approach {
// serial: this is what you sent
when ser {
for (r, c) in zD do
y[c] += x[r] * z[r, c];
}
// naive forall loop: this isn't actually correct in general
// because if the 2D iteration spaces is chunked up by rows
// (which is what Chapel typically prefers), multiple tasks
// may try to update the same column location of 'y'
// simultaneously. So why did I write it? To see how much
// faster it would go...
when naiveForall {
forall (r,c) in zD do
y[c] += x[r] * z[r, c];
}
// This is essentially the same as the previous, simply written
// shorthand. Like the previous, it isn't actually correct in
// general.
when naiveForallExpr {
[(r,c) in zD] y[c] += x[r] * z[r, c];
}
// This is the solution I picked
when taskPrivPar {
forall (r,c) in zD with (+ reduce y) do
y[c] += x[r] * z[r, c];
}
// This is a variation on the previous in which we use a shared
// atomic result vector rather than a shared race-y array or a
// private per task vector. The downside is that each '+'
// operation becomes a more expensive atomic '+', along with the
// cache downsides of sharing a result vector.
when atomicPar {
var yAt: [columns] atomic R;
forall (r,c) in zD do
yAt[c].add(x[r] * z[r, c]);
y = yAt.read();
}
otherwise {
compilerError("Unsupported algorithm");
}
}
// scale the result vector
y /= h;
// return the result vector
return y;
}
```

What was most intruiging about this experiment was that I expected the two illegal versions to be the fastest (because of the lack of need for any synchronization), yet incorrect (for the same reason). However, on my laptop at least, they were not only wrong, but also not the fastest (presumably because of the need for multiple tasks to battle over the shared vector in the cache?). For example:

```
$ ./testit --m=30000 --n=30000 --verify=false
approach naiveForall took: 0.478617
NOTE: naiveForall isn't guaranteed to give correct results
approach naiveForallExpr took: 0.513474
NOTE: naiveForallExpr isn't guaranteed to give correct results
approach taskPrivPar took: 0.448558
approach atomicPar took: 2.1744
```

There is a part of this conversation that remains somewhat mysterious to me however, though I feel I have to trust that you know what you're doing. Specifically, when I think of linear algebra and matrix-vector multiplications, I imagine the computation to be in this slightly different form:

```
// x would be of the form: var x: [cols] R;
var y : [rows] R;
forall (r,c) in zD with (+ reduce y) do
y[r] += z[r, c] * x[c];
```

to match the mathematical `y = Z*x`

; yet your computation is more like the transpose of this. Is there a good intuition to get me out of this mental trough and on-board with what you've written?

I mention this because writing the naive versions of my form results in code that is typically legal in Chapel by default *as long as the number of columns exceeds the number of tasks (cores)*. The reason for this is because when we parallelize by chunking up the rows, the result is that each task has a unique sub-vector of the solution to write to (and this is the case that motivates partial reductions from a performance perspective more than yours—since the implementation of that partial reduction could check for the *typically* caveat above and fall back on the more expensive solution if need be).

Anyway, I hope this solution might be helpful and useful, at least until we get a chance to return to partial reductions.

Thanks,

-Brad