[Chapel Merge] Add an option to interleave large allocations for

Branch: refs/heads/main
Revision: 49802f8
Author: ronawho
Log Message:

Merge pull request #18350 from ronawho/interleave-lg-allocations-flag

Add an option to interleave large allocations for gasnet-ibv-large

[discussed with @bradcray and @gbtitus, full review post-merge]

Add a compiler option --interleave-memory that will result in
interleaving large allocations under gasnet-ibv segment large. #18299
reduced memory fragmentation for configurations that use a fixed heap
but it hurt the NUMA affinity for configurations that use a fixed heap
but still rely on first-touch to set NUMA affinity (which I think is
just gasnet-ibv-large.) As a stopgap to reduce the performance impact
for that, this adds an option to interleave pages (round robin affinity
between NUMA domains.) This limits peak performance for NUMA sensitive
applications, but also significantly reduces the worst-case performance
impact. For traditional HPC applications that just allocate a few large
arrays the old first-touch behavior is better so make this feature
opt-in instead of enabling it by default.

Longer term I think we want to separately allocate large arrays and
completely return them to the system so we can get fresh NUMA affinity
(this is what we do for ugni). In the short-term this provides a way to
limit the performance impact from memory reuse that was caused by
reducing fragmentation for applications like Arkouda that do lots of
varying sized dynamic allocations.

Currently, this is a developer compiler flag. It's a developer option
since we view this as a short term workaround until we can do separate
allocations or something else, and it's a compiler flag so that it's
easier for applications to detect if the current chapel compiler/build
supports this option by inspecting the output of chpl --devel --help.

Note that we're only interleaving large allocations. Small allocations
(task tasks, aggregation buffers) still have first-touch semantics,
which should result in better NUMA affinity for these types of data
structures.

Performance results for Arkouda numeric operations (GiB/s):

config argsort 2-coarg 2-groupby gather scatter reduce
original 10.86 8.93 5.67 124.93 150.27 3377.73
fragmentation fix 10.45 8.35 4.54 113.62 132.85 1978.72
interleave alloc 10.02 8.83 5.56 124.54 128.07 2061.52

Part of #18286

Modified Files:
M compiler/codegen/codegen.cpp

M compiler/include/driver.h
M compiler/main/driver.cpp
M runtime/include/chpl-topo.h
M runtime/include/chplcgfns.h
M runtime/src/mem/jemalloc/mem-jemalloc.c
M runtime/src/topo/hwloc/topo-hwloc.c
M runtime/src/topo/none/topo-none.c
M util/chpl-completion.bash

Compare: https://github.com/chapel-lang/chapel/compare/98251ec222b9...49802f888994