I have run into some scalability issues when running on a Cray EX machine (UK's Archer2).
My installation process:
machine info: Linux uan01 4.12.14-197.56_9.1.44-cray_shasta_c #1 SMP Fri Oct 9 22:00:11 UTC 2020 (6d7e380) x86_64
CHPL_LAUNCHER: slurm-srun *
At this point, I compiled stream.chpl from the examples dir, which works as expected:
salloc --nodes=1 --tasks-per-node=1 --cpus-per-task=128 srun --distribution=block:block --hint=nomultithread --unbuffered --kill-on-bad-exit ./stream_real '-nl' '1'
Performance (GB/s) = 222.734
salloc --nodes=2 --tasks-per-node=1 --cpus-per-task=128 srun --distribution=block:block --hint=nomultithread --unbuffered --kill-on-bad-exit ./stream_real '-nl' '2'
Performance (GB/s) = 444.349
However, when I try to run a stencil benchmark such as this one (or much simpler ones), performance is as expected on a single node (1 locale), but as soon as I try to go to 2 nodes or more, it slows down 3-4x. I suspect there is something not set up properly for the communications, but I am unsure what. Can you please advise?