[Chapel Merge] Further reduce communication for BlockDist scans

Branch: refs/heads/main
Revision: 64195fc
Author: ronawho
Link: Further reduce communication for BlockDist scans by ronawho · Pull Request #20100 · chapel-lang/chapel · GitHub
Log Message:

Merge pull request #20100 from ronawho/reduce-block-dist-scan-comm

Further reduce communication for BlockDist scans

[reviewed by @benharsh]

In #19968, we improved scan scalability by significantly reducing the
amount of communication, leaving only 16 GETS and 3 PUTs on non-0
locales. This further improves that and gets us down to 6 GETs and 1
PUT. The GETs/PUTs are all coming from the result array creation
that's part of the scan, and the scan itself has no RDMA. So the scan
itself now has 0 GETs/PUTs and just 2 coforall+ons from locale 0 and a
remote-on back to locale 0 from each non-0 locale.

This results in a minor performance improvement taking a trivial sized
scan on 64 nodes of an InfiniBand system from 0.004s to 0.003s. This
was mostly motivated by wanting to understand where the comm was even
coming from, but the minor performance improvement is nice.

This is split up in 4 commits summarized below:

Reduce comm for storing prescan results