New Issue: Should we always unroll (inner) 'foreach' loops by 2–4?

bradcray-github · April 15, 2024, 11:07pm

24864, "bradcray", "Should we always unroll (inner) 'foreach' loops by 2–4?", "2024-04-15T23:06:04Z"

Should we always unroll (inner) `foreach` loops by 2–4?

opened 11:06PM - 15 Apr 24 UTC

area: Compiler type: Performance

This weekend, while thinking about the performance difference between [nbody3](h…ttps://github.com/chapel-lang/chapel/blob/main/test/studies/shootout/submitted/nbody3.chpl) and the current [nbody-blc](https://github.com/chapel-lang/chapel/blob/main/test/studies/shootout/nbody/bradc/nbody-blc.chpl), which uses `foreach` loops rather than `param`-unrolled loops, and how to minimize the differences between the two, I found myself wondering whether the Chapel compiler could/should always just unroll `foreach` loops by 2–4 in order to improve the back-end vectorizer's chances of doing good stuff with them. The obvious downside would be generated code bloat, particularly for loops that weren't particularly vectorizable to begin with, which might suggest doing it just for innermost loops (but maybe even that would be too much bloat)? And for that reason, it may be better to just invest more effort in teaching the LLVM back-end enough about Chapel's foreach loops that it could make such decisions itself. But in the meantime, I ended up being curious where this approach would get us, given that it seems fairly simple.

This weekend, while thinking about the performance difference between nbody3 and the current nbody-blc, which uses foreach loops rather than param-unrolled loops, and how to minimize the differences between the two, I found myself wondering whether the Chapel compiler could/should always just unroll foreach loops by 2–4 in order to improve the back-end vectorizer's chances of doing good stuff with them.

The obvious downside would be generated code bloat, particularly for loops that weren't particularly vectorizable to begin with, which might suggest doing it just for innermost loops (but maybe even that would be too much bloat)? And for that reason, it may be better to just invest more effort in teaching the LLVM back-end enough about Chapel's foreach loops that it could make such decisions itself. But in the meantime, I ended up being curious where this approach would get us, given that it seems fairly simple.

Topic		Replies	Views
Vectorized and parallell loops. Can I have both? Developers	8	290	June 6, 2022
Announcing Chapel 1.28.0! Announcements	0	331	September 16, 2022
Announcing Chapel version 1.19! Announcements	0	235	March 22, 2019
Unrolling For Loops Users	3	215	November 16, 2021
I love chapel's set array operations, but the edge cases are really hard Users	7	232	June 22, 2022

New Issue: Should we always unroll (inner) 'foreach' loops by 2–4?

Related Topics