Werid performance slow-down

damianmoz · January 11, 2023, 11:04am

Given the following matrix multiplication routine

proc mmx(y : [?yD] R, const p : [?pD] R, const q : [?qD] R)
{
    param zero = 0:R;
    const (rows, inner) = pD.dims();
    const (_i, columns) = qD.dims();
    const tail = inner.size & 3;
    const lo = inner.low;
    const most = lo + tail .. inner.high by 4;
    const rest = lo .. lo + tail - 1;

    for c in columns do
    {
        var b : [inner] R;

        foreach i in inner do
        {
            b[i] = q[i, c];
        }
        for r in rows do
        {
            // const ref pr = p[r, inner]; // enable this to see bug
            var s = zero;

            for i in rest do
            {
                s += b[i] * p[r, i];
            }
            foreach i in most do // unroll loop
            {
                s +=
                    p[r, i + 0] * b[i + 0] + p[r, i + 1] * b[i + 1] +
                    p[r, i + 2] * b[i + 2] + p[r, i + 3] * b[i + 3] ;
            }
            y[r, c] += s;
        }
    }
}

If you enable the line marked enable, which technically does nothing, the performance dies, like by a factor of 2!!

Compiler is 1.29.0 with

--fast --ieee-float

Thoughts anybody???

e-kayrakli · January 11, 2023, 5:22pm

Hi Damian,

That line creates a rank-change view operation on the rhs, which has some costs. A workaround could be to use slice views by keeping the view 2D by p[r..r, inner], which also isn't free but probably cheaper than rank-change operation. It'll also change the way you use pr as now you have a 2D array, albeit only valid index in the first rank being r.

The compiler could ideally remove that line as it has no effect, but with the array implementation and the complications of the rank-change operation it is arguably significantly harder for the compiler to prove that that operation doesn't have any side-effects, as such, it hasn't been a priority in our compiler/performance efforts.

Engin

damianmoz · January 11, 2023, 11:12pm

Yes. It is a rank change. But such a slice operation has been around in languages for over half a century, long before any of us have been using computers. So it really needs to be put on the list for optimization even if it is not top of the list because I realize that you guys are busy. It is a fundamental operation in Matrix Algebra.

However, there are two points in my code. Firstly, he ref is never used. And yet it slows down the matrix multiplication of two 800*800 matrices down by a factor of two. That is scary.

Also, and I could be wrong, I would have thought that if I create the reference to an entire row prior to some loop, then surely the accesses to elements during that loop using that ref would be heaps faster than the alternative of references to the individual elements within the original 2D array.

Tell me if I am pushing for something that really is not top of the priority list. I have just realized that some of the issues I had with optimizing the SOR Poisson equation was related to this. Also, it seems like the performance of using a ref has actually gotten worse since 1.22.0 but that may be a total misconception as I have not tested this (and do not have the time to go back and test). Or maybe other things have gotten faster and that has been left alone.

Thanks for listening.

bradcray · January 12, 2023, 12:00am

Hi Damian —

There's no need to punctuate points that you want to make with an underhanded (or blatant) insult. You asked a very broad question, and Engin answered it. You're obviously not happy with the status quo or implications of the answer, but that's no need to react insultingly.

We're well aware that slicing has a long history, but Chapel has a feature-rich, aggressive language design that's doesn't have the luxury of being implemented in terms of those previous solutions, so we must re-invent them. Moreover, because differences in our needs and design goals (e.g., supporting distributed arrays, permitting users to create their own array implementations and distributions), it's hard to imagine how we would leverage prior art (i.e., those half-century old solutions don't support user-defined distributed arrays).

Rank-change has not been a priority to date because uses of it have been fairly minimal in user codes so far, and rank-preserving slices have typically sufficed. That said, rank-change has started to see more use recently, leading us to open issues like Optimize rank change slices to support a lazy creation mode, similar to rank-preserving slices · Issue #20749 · chapel-lang/chapel · GitHub. You are of course also welcome to file issues of your own requesting improvements in performance or changes in behavior.

You note:

the ref is never used. And yet it slows down the matrix multiplication of two 800*800 matrices down by a factor of two. That is scary.

Engin addressed this in his response:

The compiler could ideally remove that line as it has no effect, but with the array implementation and the complications of the rank-change operation it is arguably significantly harder for the compiler to prove that that operation doesn't have any side-effects, as such, it hasn't been a priority in our compiler/performance efforts.

So this point was obviously not lost on him. Tagging onto his response, I'll add that we generally haven't expended effort in our compiler optimizing away dead code that a user writes but didn't actually want to execute. There are a number of reasons for this, some smaller, some bigger:

we've arguably got bigger fish to fry at this point (e.g., I'd argue that optimizing rank-change slices feels more important than eliminating unused ones)
users can typically delete code that they don't want to execute rather than relying on the compiler to eliminate it for them
it's been useful for benchmarking purposes
we can often rely on the back-end compiler to take care of dead-code elimination for scalar operations (unfortunately, the code we generate for an array slice is complicated enough that we can't expect it to take care of such cases for us)
due to the complications within our compiler that Engin cites

I would have thought that if I create the reference to an entire row prior to some loop, then surely the accesses to elements during that loop using that ref would be heaps faster than the alternative of references to the individual elements within the original 2D array.

In our current implementation, this isn't the case. Rank-change slicing is currently implemented by converting the 1D array access of the slice into a 2D access on the original array. This approach is taken primarily because of the need to support a rank-change slice of any kind of array—a local array, a block-distributed array, a cyclic-distribted array, the array implementation that you might write, etc. This is why Chapel can't generally use a simple dope vector manipulation as in F90. (Could Chapel optimize rank-change slices of local arrays using dope-vector-like techniques? Quite likely, but it doesn't today).

it seems like the performance of using a ref has actually gotten worse since 1.22.0 but that may be a total misconception as I have not tested this (and do not have the time to go back and test).

I'm not certain, but I'd be surprised if that were the case—neither refs nor rank-change slices have undergone significant changes since 1.22 that I can think of. I also don't think it's the ref that's resulting in the additional performance overhead here, but the rank-change slice itself and the indexing into it.

Tell me if I am pushing for something that really is not top of the priority list.

It definitely isn't at present. Current priorities are listed here:

-Brad

Topic		Replies	Views
Slicing with rank-reduction performance Users	4	303	May 10, 2022
Performance of assignments involving slices Users	12	277	December 13, 2021
Reindexing to same array Users	12	428	April 21, 2021
Performance Considerations Users	19	411	April 21, 2021
Partial reductions (operate on 2D-array and produce 1-D array) Users	19	441	December 31, 2022

Werid performance slow-down

Related topics