[Chapel Merge] Improve NUMA affinity and startup times for config

Branch: refs/heads/master
Revision: c5e9e86
Author: ronawho
Log Message:

Merge pull request #17405 from ronawho/par-interleave-heap-init

Improve NUMA affinity and startup times for configs that use a fixed heap

[reviewed by @gbtitus]

Improve the startup time and NUMA affinity for configurations that use a
fixed heap by interleaving and parallelizing the heap fault-in. High
performance networks require that memory is registered with the NIC/HCA
in order to do RDMA. We can either register all communicable memory at
startup using a fixed heap or we can register memory dynamically at some
point after it's been allocated in the user program.

Static registration can offer better communication performance since
there's just one registration call at startup and no lookups or
registration at communication time. However, static registration causes
slow startup because all memory is being faulted in at program startup
and prior to this effort that was done serially as a side effect of
registering memory with the NIC. Serial fault-in also resulted in poor
NUMA affinity and ignored user first-touch. Effectively, this meant that
most operations were just using memory out of NUMA domain 0, which
created a bandwidth bottleneck. Because of slow startup and poor
affinity we historically preferred dynamic registration when available
(for gasnet-ibv we default to segment large instead of fast, for ugni we
default we prefer dynamic registration.)

This PR improves the situation for static registration by touching the
heap in parallel prior to registration, which improves fault-in speed.
We also interleave the memory faults so that pages are spread
round-robin or cyclically across the NUMA domains. This results in
better NUMA behavior since we're not just using NUMA domain 0. Half our
memory references will still be wrong so NUMA affinity isn't really
"better" we're just spreading load between the memory controllers.

Here are some performance results for stream on a couple different
platforms. Stream has no communication and is NUMA affinity sensitive.
The tables below show the reported benchmark rate and the total
execution time to show startup costs. Results for dynamic registration
are shown as a best case comparison. Results have been rounded to make
them easier to parse (nearest 5 GB/s and 1 second.) Generally speaking
we see better, but not perfect performance and significant improvements
in startup time.

export CHPL_LAUNCHER_REAL_WRAPPER=$CHPL_HOME/util/test/timers/highPrecisionTimer
chpl examples/benchmarks/hpcc/stream.chpl --fast
./stream -nl 8 --m=2861913600

Cray XC: