Hello,
I have run into some scalability issues when running on a Cray EX machine (UK's Archer2).
My installation process:
$export CHPL_LAUNCHER=slurm-srun
$cd chapel-1.24.1
$export CHPL_HOME=pwd
$make
$source util/setchplenv.bash
$util/printchplenv
machine info: Linux uan01 4.12.14-197.56_9.1.44-cray_shasta_c #1 SMP Fri Oct 9 22:00:11 UTC 2020 (6d7e380) x86_64
CHPL_TARGET_PLATFORM: hpe-cray-ex
CHPL_TARGET_COMPILER: cray-prgenv-cray
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: x86-rome
CHPL_LOCALE_MODEL: flat
CHPL_COMM: ofi
CHPL_LIBFABRIC: system
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-srun *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_NETWORK_ATOMICS: ofi
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_REGEXP: re2
CHPL_LLVM: none
CHPL_AUX_FILESYS: none
At this point, I compiled stream.chpl from the examples dir, which works as expected:
salloc --nodes=1 --tasks-per-node=1 --cpus-per-task=128 srun --distribution=block:block --hint=nomultithread --unbuffered --kill-on-bad-exit ./stream_real '-nl' '1'
Performance (GB/s) = 222.734
salloc --nodes=2 --tasks-per-node=1 --cpus-per-task=128 srun --distribution=block:block --hint=nomultithread --unbuffered --kill-on-bad-exit ./stream_real '-nl' '2'
Performance (GB/s) = 444.349
However, when I try to run a stencil benchmark such as this one (or much simpler ones), performance is as expected on a single node (1 locale), but as soon as I try to go to 2 nodes or more, it slows down 3-4x. I suspect there is something not set up properly for the communications, but I am unsure what. Can you please advise?
Best,
Istvan