[Chapel Merge] Improve BlockDist scan scalability

Branch: refs/heads/main
Revision: 46ffb75
Author: ronawho
Link: Improve BlockDist scan scalability by ronawho · Pull Request #19968 · chapel-lang/chapel · GitHub
Log Message:

Merge pull request #19968 from ronawho/improve-block-dist-scan-scalability

Improve BlockDist scan scalability

[reviewed by @bradcray]

Improve the scalability of scans on block distributed arrays by removing
replicated array creation and reducing comm in a serial section.

At a high level, block uses a multi-pass scan. First, each task does a
local scan on its region of an array, then a serial inter locale scan of
the per-locale results is done, and finally each task will update its
region with the per-locale results.

Previously, we used replicated arrays to store per-locale results and
coordinate when results were ready. This led to performance issues since
replicated array creation scales poorly (Cray/chapel-private#1756) and
having initial results and ready flags distributed meant the serial scan
did blocking remote comm to check the ready flags and read results.

This switches to local arrays for the initial results and ready flags.
Remote locales update this in parallel during their local scan and the
serial inter locale scan only has to operate on local data. For the
output ready flags, a manual replicated-like array is used. On the
initiating node there's a local array of classes that point to a remote
sync var. These sync vars are allocated by the remote nodes during the
initial scan and the address is stored back on the initiating node. This
allows the remote nodes to wait on something local while still allowing
the initiating node to know the addresses to wake them up. Even if
replicated creation was optimized, this scheme is faster because only
the initiating needs to know about the remote allocations and the remote
allocations are done as part of existing computations.

Overall, these changes significantly decrease the amount of comm and
improve scalability. At 512 nodes on an XC we see a trivial sized scan
go from ~0.5s to ~0.005s, where ~0.3 is from removing the replicated
array creation and ~0.2 is from improving the speed of the serial
region.

Here's some timings and comm counts:

chpl test/scan/scanPerf.chpl --fast --no-cache-remote
./scanPerf --printTiming --printArray=false --n=1024 -nl {16,512}

Execution time:

Config 16 nodes 512 nodes
Before 0.0091s 0.5022s
Now 0.0005s 0.0041s

Comm (non-0 locales GETs):

Config 16 nodes 512 nodes
Before 1514 33678
Now 16 16

Comm (locale 0 PUTs):

Config 16 nodes 512 nodes
Before 2007 74175
Now 0 0

And just a summary of the total comm at 512 nodes now:

locale get put execute_on
 0 |     0 |     0 |          2 |

non-0 | 16 | 3 | 3 |

Where the important things to note are that there is way less comm than
before, and the amount of comm per node is constant regardless of scale.

Resolves Cray/chapel-private#1791
Motivated by Bears-R-Us/arkouda#1404

Modified Files:
M modules/dists/BlockDist.chpl

M test/modules/bradc/printModStuff/foo.good

Compare: https://github.com/chapel-lang/chapel/compare/965e66c99299...46ffb75519e4