19350, "bradcray", "spectral-norm shootout, next steps", "2022-03-04T00:01:13Z"
While looking at spectral norm recently and getting ramped back up on it, I wanted to note possible next steps, starting from something resembling the version in Spectral-norm: Change / 2 into * 0.5 to see the impact on nightly testing by bradcray · Pull Request #19349 · chapel-lang/chapel · GitHub (which we expect to submit to the CLBG site soon):
-
Unrolling the main loops by 2 and changing the arguments of 'A()
from
intto
real` resulted in nearly a 2x speedup on my Mac and chapcs07, presumably due to vectorization. That said, I have not done any investigation of the generated code to understand what is being done or why. That would be good to understand better. Specifically:- is there a way we could get similar performance from a
foreach
/ without unrolling by 2?- we can't use a foreach directly in the reduction (but would like to, I think), due to #19336
- I've tried using a manual foreach and accumulation into a scalar, but didn't get good performance unless I unrolled it by 2 again
- as a programmer, it would never occur to me to change the arguments of A from int to real to improve performance because I think of reals as being strictly more expensive. And I think the compiler could never do such a transformation since int->real can lose information. Yet is there something we could do to help users get the benefits that this resulted in?
- unrolling by 2 also seems likely to be specific to a particular processor / vector architecture, which is unfortunate; how could we express this more cleanly and without presuming a specific implementation?
- is there a way we could get similar performance from a
-
The
tmp
vector declared inmain()
really doesn't have any purpose outside ofmultiplyAtAv()
other than to avoid allocations and de-allocations. It would be nice to make it static tomultiplyAtAv()
for this reason as proposed in #12281 -
If/when we have partial reductions, using them for the multiplyA[t]v() routines would be natural and clean up the code nicely. Of course, doing so would also make it more challenging to manually unroll by 2, which suggests it would want a more satisfying approach as indicated in bullet 1 above.
-
I don't know that anybody has done a careful study of our fastest version vs. the fastest ones on the site currently, so that's always of interest as well: Is there a place where we're spending too much time?