Array initialization by expression

damianmoz · December 11, 2022, 1:30pm

The declaration and initialization sequence

var e : [eD] real;

forall (i,j,k) in eD do e[i, j, k] = expression;

is way faster than the far more compact

const e = forall (i,j,k) in eD do expression;

Any ideas why?

vass · December 13, 2022, 2:40am

Damian, how big is your eD ? Asymptotically the two approaches should have identical performance.

While the first approach performs parallel iteration over a domain, the second approach does parallel zippered iteration over the resulting array and the iterable object for the forall expression. [Reference: it is the loop 'forall (r, src) in zip(result, ir)' in chpl__initCopy_shapeHelp() in modules/internal/ChapelArray.chpl]

So it is likely that the additional overhead in the second approach is observable for a smaller eD.

damianmoz · December 13, 2022, 2:04pm

eD is {1..130, 1..130, 1..130},

It is heavily parallelized. Running on a 2*12 core system, the first takes 30 seconds. The second takes 76 seconds.

EDIT - The elapsed time consumed by the offending routing which does a

max reduce abs(e);

and nothing else, goes up by approximately a factor of 4.

EDIT - actually a factor of nearly 5 (which I investigate a bit more).

If you tell me how to attach a file to a Discourse post, I can upload the code.

cassella · December 21, 2022, 11:53pm

It seems discourse has a very short list of allowed attachment file types, and .chpl isn't on it (nor is .txt). You could format the code as a .pdf or .png.

I think I read somewhere that maybe someone with chapel.discourse admin rights could add .chpl or .txt to the allowed list?

Another option if the code is too long to post inline might be to use a gist: https://gist.github.com/ or even create a github repo for it if it's even more complex.

damianmoz · December 22, 2022, 12:30am

See calc_erroX below. These are tweaks of what exists in Anna Jesus's finite difference code as seen in the Discourse posting about "Code Optimization". For performance reasons, they return the worst scalar error, NOT the array of errors as Anna does. The one which initialises the array in the declaration has woeful performance. I got rid of the divisions from the original code.

    proc calc_erroY(f, tf,  dx2: real, dy2: real, dz2: real)
    {
        param one = 1.0;
        const rdx2 = one / dx2;
        const rdy2 = one / dy2;
        const rdz2 = one / dz2;
        const (_i, _j, _k) = f.domain.dims();
        const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
        var e : [innerIJK] real;

        forall (i,j,k) in innerIJK do
        {
            const fijk = f[i,j,k], t = fijk + fijk;

            e[i,j,k] = ((f[i,j,k-1] + f[i,j,k+1]) - t)*rdz2 - tf[i,j,k]
                + ((f[i,j-1,k] + f[i,j+1,k]) - t)*rdy2
                + ((f[i-1,j,k] + f[i+1,j,k]) - t)*rdx2;
        }
        return max reduce(abs(e));
    }
    proc calc_erroZ(f, tf,  dx2: real, dy2: real, dz2: real)
    {
        param one = 1.0;
        const rdx2 = one / dx2;
        const rdy2 = one / dy2;
        const rdz2 = one / dz2;
        const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);
        const (_i, _j, _k) = f.domain.dims();
        const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
        var e : [innerIJK] real;

        forall (i,j,k) in innerIJK do
        {
            const _f = f[i,j,k], t = _f + _f;

            e[i,j,k] = tf[i,j,k] -
                (f[i,j,k-1] - t + f[i,j,k+1])*rdz2 -
                (f[i,j-1,k] - t + f[i,j+1,k])*rdy2 -
                (f[i-1,j,k] - t + f[i+1,j,k])*rdx2;
        }
        return max reduce(abs(e));
    }
    proc calc_erroX(f, tf,  dx2: real, dy2: real, dz2: real)
    {
        param one = 1.0;
        const rdx2 = one / dx2;
        const rdy2 = one / dy2;
        const rdz2 = one / dz2;
        const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);
        const (_i, _j, _k) = f.domain.dims();
        const innerIJK = {_i.expand(-1), _j.expand(-1), _k.expand(-1)};
        const e = forall (i,j,k) in innerIJK do
            ((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
            ((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
            ((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k];

        return max reduce(abs(e));
    }
    inline proc calc_erro(f, tf,  dx2: real, dy2: real, dz2: real)
    {
        // enable or disable to taste
        // return calc_erroY(f, tf,  dx2, dy2, dz2); // 30 seconds
        return calc_erroZ(f, tf,  dx2, dy2, dz2); // 30 seconds
        // return calc_erroX(f, tf,  dx2, dy2, dz2); // 76 seconds
    }

vass · December 23, 2022, 2:53am

Damian,

Sorry for the uploading trouble. We will look into allowing uploads of Chapel code.

Nothing in your calc_erroXYZ jumps at me as a plausible cause of performance difference, and I do not have the time to dig deeper.

Given how nicely you organized the code, I am curious about the performance of a version that does not create the array e altogether, instead reducing directly the values produced by the forall loop:

    return max reduce abs(
      forall (i,j,k) in innerIJK do
        ((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
        ((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
        ((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k]
    );

bradcray · January 3, 2023, 9:27pm

This should be enabled now.

-Brad

cassella · January 4, 2023, 12:27am

checking.chpl (42 Bytes)

vass · January 4, 2023, 1:03am

Thanks for checking @cassella. It works -- I can download the .chpl file that you uploaded.

bradcray · March 11, 2023, 2:11am

I'm coming very late to the meat of this topic after a December in which I couldn't keep up and am wondering if the original issue Damian asks about could be explained by:

github.com/chapel-lang/chapel

Multidimensional zippered iteration (or promotion) kills performance

opened 10:50PM - 31 May 19 UTC

bradcray

type: Design area: Compiler type: Performance area: Language

[This is a longstanding issue that I'm capturing on GitHub since I couldn't find… it here] When iterating over multiple multidimensional arrays using zippered iteration, performance currently falls off a steep cliff. For example: ```chapel var A, B: [1..n, 1..n] real; forall (a, b) in zip(A, B) do a = 2*b; ``` This can also happen when calling a scalar function in a promoted manner with multidimensional arrays, since it is equivalent to a zippered iteration: ```chapel foo(A, B); proc foo(x: real, y: real) { ... } ``` The reason for this is that the compiler currently doesn't know how to combine the parallel iterators for multidimensional arrays into a single compact loop nest, so it falls back to the catch-all implementation of creating an iterator object for each array and iterating over those objects in a zippered manner. While this result in a correct and very general-purpose implementation for very arbitrary array iterators, it is also very expensive. Until this situation is improved, users can often work around the issue for cases that share the same domain (like above) by looping over the domain and indexing into the arrays to avoid the zippered iteration: ```chapel forall ind in A.domain do A[ind] = 2*B[ind]; ``` or: ```chapel forall ind in A.domain do foo(A[ind], B[ind]); ```

If the:

const e = forall (i,j,k) in eD do expression;

expression resulted in a zippered loop over e and eD then it might. But since this is an initialization context, I'm not sure whether that will happen, or whether the RHS will create a virtual array and then e will steal it. The expertise on that question lies somewhere between @vass and @mppf.

At the same time, it's taken me so long to catch up with this question that it may be moot or not of concern anymore. If it is (even a little bit), I think it'd be interesting to capture the two code idioms and some timings on a github issue.

-Brad

damianmoz · March 12, 2023, 8:24pm

Thanks for the insight. Yes it is still an issue. Not sure I have totally wrapped my head around the detail of #13147. Should we move this discussion into that issue or should are you suggest that we create a new issue? It needs some input from someone with more understanding of the internals than I have.

mppf · March 13, 2023, 1:30pm

When initializing an array from a loop expression / iterator, indeed it ends up using a zippered loop. I think the relevant code for Damian's case is here: https://github.com/chapel-lang/chapel/blob/main/modules/internal/ChapelArray.chpl#L2635 but that should be confirmed before we rely on it.

bradcray · March 13, 2023, 5:41pm

I was suggesting moving your observations into a new issue and then if we validate that it's an instance of #13147, we can have them refer to one another.

The gist of #13147 is that zippered iteration over multidimensional domains or arrays has known performance problems at present. Iteration over a single multidimensional domain or array does not share this problem.

-Brad

damianmoz · March 13, 2023, 9:27pm

So is

Slow zippered iteration over a multidimensional domain

a sufficiently concise title for the new issue?

bradcray · March 13, 2023, 9:33pm

No, I think that's essentially what #13147 already says. For your issue I'd focus on the observation you're making rather than the cause (or presumed cause). So, for example, maybe something like "Multidimensional array initialization is much slower than multidimensional array assignment"

-Brad

damianmoz · March 13, 2023, 9:50pm

Spot on. That will teach me to not hit reply before my early morning coffee..

Thanks

Topic		Replies	Views
I love chapel's set array operations, but the edge cases are really hard Users	7	339	June 22, 2022
Array assignment Users	3	173	January 16, 2024
Reindexing to same array Users	12	501	April 21, 2021
Partial reductions (operate on 2D-array and produce 1-D array) Users	19	495	December 31, 2022
Vectorized and parallell loops. Can I have both? Developers	8	487	June 6, 2022

Array initialization by expression

Related topics