New Issue: Should we always unroll (inner) 'foreach' loops by 2–4?

24864, "bradcray", "Should we always unroll (inner) 'foreach' loops by 2–4?", "2024-04-15T23:06:04Z"

This weekend, while thinking about the performance difference between nbody3 and the current nbody-blc, which uses foreach loops rather than param-unrolled loops, and how to minimize the differences between the two, I found myself wondering whether the Chapel compiler could/should always just unroll foreach loops by 2–4 in order to improve the back-end vectorizer's chances of doing good stuff with them.

The obvious downside would be generated code bloat, particularly for loops that weren't particularly vectorizable to begin with, which might suggest doing it just for innermost loops (but maybe even that would be too much bloat)? And for that reason, it may be better to just invest more effort in teaching the LLVM back-end enough about Chapel's foreach loops that it could make such decisions itself. But in the meantime, I ended up being curious where this approach would get us, given that it seems fairly simple.