The declaration and initialization sequence

```
var e : [eD] real;
forall (i,j,k) in eD do e[i, j, k] = expression;
```

is way faster than the far more compact

```
const e = forall (i,j,k) in eD do expression;
```

Any ideas why?

The declaration and initialization sequence

```
var e : [eD] real;
forall (i,j,k) in eD do e[i, j, k] = expression;
```

is way faster than the far more compact

```
const e = forall (i,j,k) in eD do expression;
```

Any ideas why?

Damian, how big is your `eD`

? Asymptotically the two approaches should have identical performance.

While the first approach performs parallel iteration over a domain, the second approach does parallel zippered iteration over the resulting array and the iterable object for the forall expression. [Reference: it is the loop 'forall (r, src) in zip(result, ir)' in chpl__initCopy_shapeHelp() in modules/internal/ChapelArray.chpl]

So it is likely that the additional overhead in the second approach is observable for a smaller eD.

eD is {1..130, 1..130, 1..130},

It is heavily parallelized. Running on a 2*12 core system, the first takes 30 seconds. The second takes 76 seconds.

EDIT - The elapsed time consumed by the offending routing which does a

```
max reduce abs(e);
```

and nothing else, goes up by approximately a factor of 4.

EDIT - actually a factor of nearly 5 (which I investigate a bit more).

If you tell me how to attach a file to a Discourse post, I can upload the code.

It seems discourse has a very short list of allowed attachment file types, and .chpl isn't on it (nor is .txt). You could format the code as a .pdf or .png.

I think I read somewhere that maybe someone with chapel.discourse admin rights could add .chpl or .txt to the allowed list?

Another option if the code is too long to post inline might be to use a gist: https://gist.github.com/ or even create a github repo for it if it's even more complex.

See `calc_erroX`

below. These are tweaks of what exists in Anna Jesus's finite difference code as seen in the Discourse posting about "Code Optimization". For performance reasons, they return the worst scalar error, **NOT** the array of errors as Anna does. The one which initialises the array in the declaration has woeful performance. I got rid of the divisions from the original code.

```
proc calc_erroY(f, tf, dx2: real, dy2: real, dz2: real)
{
param one = 1.0;
const rdx2 = one / dx2;
const rdy2 = one / dy2;
const rdz2 = one / dz2;
const (_i, _j, _k) = f.domain.dims();
const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
var e : [innerIJK] real;
forall (i,j,k) in innerIJK do
{
const fijk = f[i,j,k], t = fijk + fijk;
e[i,j,k] = ((f[i,j,k-1] + f[i,j,k+1]) - t)*rdz2 - tf[i,j,k]
+ ((f[i,j-1,k] + f[i,j+1,k]) - t)*rdy2
+ ((f[i-1,j,k] + f[i+1,j,k]) - t)*rdx2;
}
return max reduce(abs(e));
}
proc calc_erroZ(f, tf, dx2: real, dy2: real, dz2: real)
{
param one = 1.0;
const rdx2 = one / dx2;
const rdy2 = one / dy2;
const rdz2 = one / dz2;
const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);
const (_i, _j, _k) = f.domain.dims();
const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
var e : [innerIJK] real;
forall (i,j,k) in innerIJK do
{
const _f = f[i,j,k], t = _f + _f;
e[i,j,k] = tf[i,j,k] -
(f[i,j,k-1] - t + f[i,j,k+1])*rdz2 -
(f[i,j-1,k] - t + f[i,j+1,k])*rdy2 -
(f[i-1,j,k] - t + f[i+1,j,k])*rdx2;
}
return max reduce(abs(e));
}
proc calc_erroX(f, tf, dx2: real, dy2: real, dz2: real)
{
param one = 1.0;
const rdx2 = one / dx2;
const rdy2 = one / dy2;
const rdz2 = one / dz2;
const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);
const (_i, _j, _k) = f.domain.dims();
const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
const e = forall (i,j,k) in innerIJK do
((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k];
return max reduce(abs(e));
}
inline proc calc_erro(f, tf, dx2: real, dy2: real, dz2: real)
{
// enable or disable to taste
// return calc_erroY(f, tf, dx2, dy2, dz2); // 30 seconds
return calc_erroZ(f, tf, dx2, dy2, dz2); // 30 seconds
// return calc_erroX(f, tf, dx2, dy2, dz2); // 76 seconds
}
```

Damian,

Sorry for the uploading trouble. We will look into allowing uploads of Chapel code.

Nothing in your calc_erroXYZ jumps at me as a plausible cause of performance difference, and I do not have the time to dig deeper.

Given how nicely you organized the code, I am curious about the performance of a version that does not create the array `e`

altogether, instead reducing directly the values produced by the forall loop:

```
return max reduce abs(
forall (i,j,k) in innerIJK do
((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k]
);
```

This should be enabled now.

-Brad

I'm coming very late to the meat of this topic after a December in which I couldn't keep up and am wondering if the original issue Damian asks about could be explained by:

If the:

```
const e = forall (i,j,k) in eD do expression;
```

expression resulted in a zippered loop over `e`

and `eD`

then it might. But since this is an initialization context, I'm not sure whether that will happen, or whether the RHS will create a virtual array and then `e`

will steal it. The expertise on that question lies somewhere between @vass and @mppf.

At the same time, it's taken me so long to catch up with this question that it may be moot or not of concern anymore. If it is (even a little bit), I think it'd be interesting to capture the two code idioms and some timings on a github issue.

-Brad

Thanks for the insight. Yes it is still an issue. Not sure I have totally wrapped my head around the detail of #13147. Should we move this discussion into that issue or should are you suggest that we create a new issue? It needs some input from someone with more understanding of the internals than I have.

When initializing an array from a loop expression / iterator, indeed it ends up using a zippered loop. I think the relevant code for Damian's case is here: https://github.com/chapel-lang/chapel/blob/main/modules/internal/ChapelArray.chpl#L2635 but that should be confirmed before we rely on it.

I was suggesting moving your observations into a new issue and then if we validate that it's an instance of #13147, we can have them refer to one another.

The gist of #13147 is that zippered iteration over multidimensional domains or arrays has known performance problems at present. Iteration over a single multidimensional domain or array does not share this problem.

-Brad

So is

```
Slow zippered iteration over a multidimensional domain
```

a sufficiently concise title for the new issue?

No, I think that's essentially what #13147 already says. For your issue I'd focus on the observation you're making rather than the cause (or presumed cause). So, for example, maybe something like "Multidimensional array initialization is much slower than multidimensional array assignment"

-Brad

Spot on. That will teach me to **not** hit reply before my early morning coffee..

Thanks