Setup for scaling on a Cray EX


I have run into some scalability issues when running on a Cray EX machine (UK's Archer2).
My installation process:

$export CHPL_LAUNCHER=slurm-srun
$cd chapel-1.24.1
$export CHPL_HOME=pwd
$source util/setchplenv.bash
machine info: Linux uan01 4.12.14-197.56_9.1.44-cray_shasta_c #1 SMP Fri Oct 9 22:00:11 UTC 2020 (6d7e380) x86_64
CHPL_TARGET_COMPILER: cray-prgenv-cray
CHPL_TASKS: qthreads
CHPL_LAUNCHER: slurm-srun *
CHPL_TIMERS: generic
CHPL_MEM: jemalloc
CHPL_GMP: bundled
CHPL_HWLOC: bundled

At this point, I compiled stream.chpl from the examples dir, which works as expected:

salloc --nodes=1 --tasks-per-node=1 --cpus-per-task=128 srun --distribution=block:block --hint=nomultithread --unbuffered --kill-on-bad-exit ./stream_real '-nl' '1'
Performance (GB/s) = 222.734
salloc --nodes=2 --tasks-per-node=1 --cpus-per-task=128 srun --distribution=block:block --hint=nomultithread --unbuffered --kill-on-bad-exit ./stream_real '-nl' '2'
Performance (GB/s) = 444.349

However, when I try to run a stencil benchmark such as this one (or much simpler ones), performance is as expected on a single node (1 locale), but as soon as I try to go to 2 nodes or more, it slows down 3-4x. I suspect there is something not set up properly for the communications, but I am unsure what. Can you please advise?


Hi Istvan —

First of all, thanks very much for your interest in Chapel on HPE Cray EX. For what it's worth, you now have more experience with Chapel on EX than I do (though that's not saying a lot these days, given where my time tends to be spent...).

While our team has been working hard on getting Chapel working on Cray EX systems, it is unfortunately still early days for the performance optimization effort for that system. As you may be aware, we've recently been doing performance tuning work for IB-based networks which resulted in some significant improvements in the 1.24.1 release, but these were focused primarily on HPE Apollo systems using our gasnet/ibv communication option rather than Cray EX using ofi.

That said, we're definitely interested in seeing what we can do to help you along, and are looking into making sure we have access to an internal system similar to Archer2 in order to avoid having you be our eyes and hands in this tuning loop.

As a starting point sanity check, I'd be curious for what running your program with -v reports on the "COMM=ofi" line, which should indicate which provider is being used by your program.

Next, our go-to folks on performance (generally) and EX (specifically) — @ronawho and @gbtitus — recommend that you try setting the environment variable CHPL_RT_MAX_HEAP_SIZE to something like 80% of your nodes' physical memory to see whether that helps. This will hurt NUMA affinity at 2+ nodes (so will cause Stream to suffer, for example), but could help with overall performance. We'll be curious to hear what you find until we can reproduce ourselves locally.

I'll also mention, anecdotally, that on some systems (though not our best-tuned ones), we've seen benchmarks like PRK Stencil take a hit when going from 1 to 2 nodes, and then to recover as more nodes are added, and I'd be curious whether you observe this or whether performance stays bad / gets worse.

Thanks again for your question, and for any follow-up information,

Hi Brad,

Thanks for the quick reply! Running with -v on 2 locales shows me the following:

QTHREADS: Using 128 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8384512 byte stack size.
QTHREADS: Using 128 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8384512 byte stack size.
COMM=ofi: using "sockets" provider
executing on node 0 of 2 node(s): nid001208
executing on node 1 of 2 node(s): nid001209
Parallel Research Kernels Version 2.17
Serial stencil execution on 2D grid
Grid size = 8192
Radius of stencil = 2
Type of stencil = compact
Data type = real(64)
Number of iterations = 100
Distribution = Stencil
Solution validates
Total time 3.7539
Rate (MFlops/s): 91084.195987 Avg time (s): 0.037539
stencil time = 0.0127753
increment time = 0.0114022
comm time = 0.0133608

When I set heap size:

It curiously changes the "COMM" field, then I get the following runtime error:
COMM=ofi: using "verbs;ofi_rxm" provider
Fri Apr 30 08:55:31 2021: [PE_0]:_pmi2_add_kvs:ERROR: The KVS data segment of 12 entries is not large enough. Increase the number of KVS entries by setting env variable PMI_MAX_KVS_ENTRIES to a higher value.

So then I set this variable and run, which successfully completes

COMM=ofi: using "verbs;ofi_rxm" provider

But things get even slower unfortunately. Similarly, when I move to 4 nodes, things further slow down.

stream running on 4 nodes also gives me the PMI_MAX_KVS_ENTRIES error, but after setting it to larger, bandwidth is as expected (887 GB/s). When further setting CHPL_RT_MAX_HEAP_SIZE=80%, it goes down to 605 GB/s.

Hi @reguly

Thanks for this additional information. I believe the switch in ofi providers once CHPL_RT_MAX_HEAP_SIZE is set is what @gbtitus was anticipating yesterday, but today was a day off for him, so I didn't get a chance to confer with him. I expect we'll hear more from him on Monday.

Thanks for sharing your other observations with us. I think some of them are expected given the current state of things, and that others are new results for us. I believe that @ronawho made some progress into getting onto a similar system today, but timed out before being able to reproduce your results. Unless he or Greg have a suggestion for something that would be worthwhile for you to try next week, I expect that our next step here will be to reproduce the situation locally and see what we can recommend. Please keep on us if you don't hear from us early next week (though I expect you will).

Have a good weekend,

Hello Istvan --

That functional result looks reasonable to me, but we'll need to look into the performance problem. The change from "sockets" to "verbs;ofi_rxm" as the underlying communication provider is the expected result of setting a fixed heap size with CHPL_RT_MAX_HEAP_SIZE. And the PMI_MAX_KVS_ENTRIES problem is a one I've seen before and have dealt with as you did here, simply by increasing the setting. I hadn't seen that on a "production" system, however, only on in-house ones that I thought might be configured in a peculiar way. I'll make a note to look into that further, because it's obviously not something we want users to have to deal with.

And as Brad noted, we'll work on reproducing and characterizing the performance issue this week.


Hi Both,

Thank you - Archer2 is in pre-production (only 4 cabinets are up), it's possible not everything is set up properly.
If you are having issues with reproducing, I'm sure we can get you onto Archer2 itself one way or another.


Hello Istvan --

We've been able to get stencil-opt to scale on a very small EX system in-house, not so much from 1 to 2 nodes, but definitely from 2 to 4 nodes. Could you do some runs for us to compare with? We'd like it to be compiled like this:

chpl --fast --set order="sqrt(16e9*numLocales/8):int" test/studies/prk/Stencil/optimized/stencil-opt.chpl

and then run on 1, 2, and 4 nodes with basically the same settings as before, plus one more:

for nl in 1 2 4 ; do  ./stencil-opt -nl $nl ; done

thanks in advance,

1 Like

Hello Greg,

This sort of weak scaling does seem to work okay 1-2-4-8 nodes, and even to some extent for strong scaling:
Weak scaling:
(1) 32.4667 - (2) 48.1904 - (4) 45.4153 - (8) 44.339
Fixing order to 63245
(1) 64.7435 - (2) 43.0197 - (4) 24.3892 - (8) 16.1413

I am not comparing this against an MPI implementation.


Hello Istvan --

Are those values you showed times (in seconds), perhaps from a modified version of stencil-opt.chpl? I'm not seeing anything like that in output from our version of that program in the repository. When I run I see reported MFlops/s numbers of (1) 139 - (2) 180 - (4) 342, and reported "Avg time" values of under 1s for all runs. If I just time the runs with time ./stencil-opt ... I see times in the range 7s-15s.

(I can't go over 4 nodes because the system I'm using only has 4 Rome nodes.)


Ah right, this is timing the "Main loop" over 100 iterations, which I inserted separately.

Hi @reguly / all —

Just to summarize where I think we are:

  • the cause of the original poor scaling results were the use of the program's default small / toy problem size, which isn't large enough to amortize overheads as compute nodes are added. Once a larger problem size was used, scaling improved (steady-ish performance for weak scaling, faster times for strong scaling)
  • it's still early days for Chapel on this system, and there are still improvements we could/should do. As I understand it, a major pain point relates to not having good NUMA-sensitive behavior because of the way memory is registered with the network (which I believe is the cause of the fall-off from 1–2 nodes in the weak scaling experiment; as well as a reason that the performance isn't closer to where one might expect using multiple nodes). This is a challenge we have on other systems as well, but this is probably the highest-profile system in which we've run into it.

Does that accurately summarize where we are, and Istvan, do you feel as though your original query has been addressed?


Hi @bradcray,

Yes, I think that summarizes it fairly well so far. But, see below for some comparisons with MPI.

Below are what I think are important further datapoints for comparison, using the MPI version from the parallel research kernels. These use the star shape stencil (vs. compact in Chapel tests). Avg time *100 (as I ran 100 iterations with both versions)
weak scaling:
(1) 27.1 - (2) 26.9 - (4) 27.18 - (8) 27.19
strong scaling with 63245 size
(1) 54 - (2) 26.9 - (4) 13.58 - (8) 6.8

Additionally, as to the small/toy problem size (8192) with pure MPI:
On 1 node, I do get 0.83 sec, compared to 0.94 with Chapel, which is in line with the above. However, on 2 nodes, I get 0.2 (yes, superlinear scaling) vs. Chapel's 3.5, and on 4 nodes 0.12, then on 8 0.063.

So, for the really large arrays, I think being within a factor of 2 is reasonable for a system that hasn't been optimized for. On the small problem I do not know if this much larger difference is "expected" -- I don't know enough about the underlying mechanisms of communications which appear to be the key bottleneck here.


1 Like

For systems we've fully tuned for (Cray XC with Aries, and recently plain InfiniBand) we're competitive with MPI for large and modest problem sizes. Results for XC are available at Chapel: Performance Highlights: PRK Stencil. Even on Aries I'd still expect us to fall behind for really small problem sizes since that's not something we've optimized too heavily for but not by the same amount you're seeing on Slingshot. Given that you're seeing superlinear scaling for the toy problem size it's probably small enough to fit in cache at 2 nodes so I agree that any comm overhead would be more pronounced.

1 Like