New Issue: spectral-norm shootout, next steps

bradcray-github · March 4, 2022, 12:02am

19350, "bradcray", "spectral-norm shootout, next steps", "2022-03-04T00:01:13Z"

spectral-norm shootout, next steps

opened 12:01AM - 04 Mar 22 UTC

type: Performance area: Tests / Benchmarks type: Refactor

While looking at spectral norm recently and getting ramped back up on it, I want…ed to note possible next steps, starting from something resembling the version in https://github.com/chapel-lang/chapel/pull/19349 (which we expect to submit to the CLBG site soon): * Unrolling the main loops by 2 and changing the arguments of 'A()` from `int` to `real` resulted in nearly a 2x speedup on my Mac and chapcs07, presumably due to vectorization. That said, I have not done any investigation of the generated code to understand what is being done or why. That would be good to understand better. Specifically: * is there a way we could get similar performance from a `foreach` / without unrolling by 2? * we can't use a foreach directly in the reduction (but would like to, I think), due to #19336 * I've tried using a manual foreach and accumulation into a scalar, but didn't get good performance unless I unrolled it by 2 again * as a programmer, it would never occur to me to change the arguments of A from int to real to improve performance because I think of reals as being strictly more expensive. And I think the compiler could never do such a transformation since int->real can lose information. Yet is there something we could do to help users get the benefits that this resulted in? * unrolling by 2 also seems likely to be specific to a particular processor / vector architecture, which is unfortunate; how could we express this more cleanly and without presuming a specific implementation? * The `tmp` vector declared in `main()` really doesn't have any purpose outside of `multiplyAtAv()` other than to avoid allocations and de-allocations. It would be nice to make it static to `multiplyAtAv()` for this reason as proposed in #12281 * If/when we have partial reductions, using them for the multiplyA[t]v() routines would be natural and clean up the code nicely. Of course, doing so would also make it more challenging to manually unroll by 2, which suggests it would want a more satisfying approach as indicated in bullet 1 above. * I don't know that anybody has done a careful study of our fastest version vs. the fastest ones on the site currently, so that's always of interest as well: Is there a place where we're spending too much time?

While looking at spectral norm recently and getting ramped back up on it, I wanted to note possible next steps, starting from something resembling the version in Spectral-norm: Change / 2 into * 0.5 to see the impact on nightly testing by bradcray · Pull Request #19349 · chapel-lang/chapel · GitHub (which we expect to submit to the CLBG site soon):

Unrolling the main loops by 2 and changing the arguments of 'A()frominttoreal` resulted in nearly a 2x speedup on my Mac and chapcs07, presumably due to vectorization. That said, I have not done any investigation of the generated code to understand what is being done or why. That would be good to understand better. Specifically:
- is there a way we could get similar performance from a foreach / without unrolling by 2?
  - we can't use a foreach directly in the reduction (but would like to, I think), due to #19336
  - I've tried using a manual foreach and accumulation into a scalar, but didn't get good performance unless I unrolled it by 2 again
- as a programmer, it would never occur to me to change the arguments of A from int to real to improve performance because I think of reals as being strictly more expensive. And I think the compiler could never do such a transformation since int->real can lose information. Yet is there something we could do to help users get the benefits that this resulted in?
- unrolling by 2 also seems likely to be specific to a particular processor / vector architecture, which is unfortunate; how could we express this more cleanly and without presuming a specific implementation?
The tmp vector declared in main() really doesn't have any purpose outside of multiplyAtAv() other than to avoid allocations and de-allocations. It would be nice to make it static to multiplyAtAv() for this reason as proposed in #12281
If/when we have partial reductions, using them for the multiplyA[t]v() routines would be natural and clean up the code nicely. Of course, doing so would also make it more challenging to manually unroll by 2, which suggests it would want a more satisfying approach as indicated in bullet 1 above.
I don't know that anybody has done a careful study of our fastest version vs. the fastest ones on the site currently, so that's always of interest as well: Is there a place where we're spending too much time?

Topic		Replies	Views
Code optimization Users	28	524	December 28, 2022
Partial reductions (operate on 2D-array and produce 1-D array) Users	19	433	December 31, 2022
Vectorized and parallell loops. Can I have both? Developers	8	431	June 6, 2022
CHAPEL performance test on WSL ubuntu Users	42	203	April 11, 2025
Cache Unfriendly Reduction Users	9	198	January 21, 2024

New Issue: spectral-norm shootout, next steps

Related topics