The declaration and initialization sequence

```
var e : [eD] real;
forall (i,j,k) in eD do e[i, j, k] = expression;
```

is way faster than the far more compact

```
const e = forall (i,j,k) in eD do expression;
```

Any ideas why?

The declaration and initialization sequence

```
var e : [eD] real;
forall (i,j,k) in eD do e[i, j, k] = expression;
```

is way faster than the far more compact

```
const e = forall (i,j,k) in eD do expression;
```

Any ideas why?

Damian, how big is your `eD`

? Asymptotically the two approaches should have identical performance.

While the first approach performs parallel iteration over a domain, the second approach does parallel zippered iteration over the resulting array and the iterable object for the forall expression. [Reference: it is the loop 'forall (r, src) in zip(result, ir)' in chpl__initCopy_shapeHelp() in modules/internal/ChapelArray.chpl]

So it is likely that the additional overhead in the second approach is observable for a smaller eD.

eD is {1..130, 1..130, 1..130},

It is heavily parallelized. Running on a 2*12 core system, the first takes 30 seconds. The second takes 76 seconds.

EDIT - The elapsed time consumed by the offending routing which does a

```
max reduce abs(e);
```

and nothing else, goes up by approximately a factor of 4.

EDIT - actually a factor of nearly 5 (which I investigate a bit more).

If you tell me how to attach a file to a Discourse post, I can upload the code.

It seems discourse has a very short list of allowed attachment file types, and .chpl isn't on it (nor is .txt). You could format the code as a .pdf or .png.

I think I read somewhere that maybe someone with chapel.discourse admin rights could add .chpl or .txt to the allowed list?

Another option if the code is too long to post inline might be to use a gist: https://gist.github.com/ or even create a github repo for it if it's even more complex.

See `calc_erroX`

below. These are tweaks of what exists in Anna Jesus's finite difference code as seen in the Discourse posting about "Code Optimization". For performance reasons, they return the worst scalar error, **NOT** the array of errors as Anna does. The one which initialises the array in the declaration has woeful performance. I got rid of the divisions from the original code.

```
proc calc_erroY(f, tf, dx2: real, dy2: real, dz2: real)
{
param one = 1.0;
const rdx2 = one / dx2;
const rdy2 = one / dy2;
const rdz2 = one / dz2;
const (_i, _j, _k) = f.domain.dims();
const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
var e : [innerIJK] real;
forall (i,j,k) in innerIJK do
{
const fijk = f[i,j,k], t = fijk + fijk;
e[i,j,k] = ((f[i,j,k-1] + f[i,j,k+1]) - t)*rdz2 - tf[i,j,k]
+ ((f[i,j-1,k] + f[i,j+1,k]) - t)*rdy2
+ ((f[i-1,j,k] + f[i+1,j,k]) - t)*rdx2;
}
return max reduce(abs(e));
}
proc calc_erroZ(f, tf, dx2: real, dy2: real, dz2: real)
{
param one = 1.0;
const rdx2 = one / dx2;
const rdy2 = one / dy2;
const rdz2 = one / dz2;
const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);
const (_i, _j, _k) = f.domain.dims();
const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
var e : [innerIJK] real;
forall (i,j,k) in innerIJK do
{
const _f = f[i,j,k], t = _f + _f;
e[i,j,k] = tf[i,j,k] -
(f[i,j,k-1] - t + f[i,j,k+1])*rdz2 -
(f[i,j-1,k] - t + f[i,j+1,k])*rdy2 -
(f[i-1,j,k] - t + f[i+1,j,k])*rdx2;
}
return max reduce(abs(e));
}
proc calc_erroX(f, tf, dx2: real, dy2: real, dz2: real)
{
param one = 1.0;
const rdx2 = one / dx2;
const rdy2 = one / dy2;
const rdz2 = one / dz2;
const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);
const (_i, _j, _k) = f.domain.dims();
const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
const e = forall (i,j,k) in innerIJK do
((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k];
return max reduce(abs(e));
}
inline proc calc_erro(f, tf, dx2: real, dy2: real, dz2: real)
{
// enable or disable to taste
// return calc_erroY(f, tf, dx2, dy2, dz2); // 30 seconds
return calc_erroZ(f, tf, dx2, dy2, dz2); // 30 seconds
// return calc_erroX(f, tf, dx2, dy2, dz2); // 76 seconds
}
```

Damian,

Sorry for the uploading trouble. We will look into allowing uploads of Chapel code.

Nothing in your calc_erroXYZ jumps at me as a plausible cause of performance difference, and I do not have the time to dig deeper.

Given how nicely you organized the code, I am curious about the performance of a version that does not create the array `e`

altogether, instead reducing directly the values produced by the forall loop:

```
return max reduce abs(
forall (i,j,k) in innerIJK do
((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k]
);
```

This should be enabled now.

-Brad