New Issue: spectral-norm shootout, next steps

19350, "bradcray", "spectral-norm shootout, next steps", "2022-03-04T00:01:13Z"

While looking at spectral norm recently and getting ramped back up on it, I wanted to note possible next steps, starting from something resembling the version in Spectral-norm: Change / 2 into * 0.5 to see the impact on nightly testing by bradcray · Pull Request #19349 · chapel-lang/chapel · GitHub (which we expect to submit to the CLBG site soon):

  • Unrolling the main loops by 2 and changing the arguments of 'A()frominttoreal` resulted in nearly a 2x speedup on my Mac and chapcs07, presumably due to vectorization. That said, I have not done any investigation of the generated code to understand what is being done or why. That would be good to understand better. Specifically:

    • is there a way we could get similar performance from a foreach / without unrolling by 2?
      • we can't use a foreach directly in the reduction (but would like to, I think), due to #19336
      • I've tried using a manual foreach and accumulation into a scalar, but didn't get good performance unless I unrolled it by 2 again
    • as a programmer, it would never occur to me to change the arguments of A from int to real to improve performance because I think of reals as being strictly more expensive. And I think the compiler could never do such a transformation since int->real can lose information. Yet is there something we could do to help users get the benefits that this resulted in?
    • unrolling by 2 also seems likely to be specific to a particular processor / vector architecture, which is unfortunate; how could we express this more cleanly and without presuming a specific implementation?
  • The tmp vector declared in main() really doesn't have any purpose outside of multiplyAtAv() other than to avoid allocations and de-allocations. It would be nice to make it static to multiplyAtAv() for this reason as proposed in #12281

  • If/when we have partial reductions, using them for the multiplyA[t]v() routines would be natural and clean up the code nicely. Of course, doing so would also make it more challenging to manually unroll by 2, which suggests it would want a more satisfying approach as indicated in bullet 1 above.

  • I don't know that anybody has done a careful study of our fastest version vs. the fastest ones on the site currently, so that's always of interest as well: Is there a place where we're spending too much time?