On my M1 macmini I see chapel programs run with about 400% cpu. It has 4 performance and 4 efficiency cores, and chapel seems to use only 4 threads at max. Is this correct ?
On a computer with 40 cores, I see cpu at 4000%. It seems to be using all 40 cores as expected.
When I set
–dataParTasksPerLocale 6 it shows 1200% cpu, was expecting 600% cpu
–dataParTasksPerLocale 8 it shows 1400% cpu, was expecting 800% cpu
Could you tell me why it is like this ?
1 Like
On a M series Mac (or any hybrid architecture), Chapel will by default only use the performance cores, not the small cores.
To change the binding Chapel uses, set the environment variable CHPL_RT_USE_PU_KIND to performance, efficiency, or all. On your Mac, all will use all the cores. By default, Chapel only uses the performance cores to prevent imbalance of the cores.
Note: the Chapel website seems to be having some issues right now, but you can find the docs for these variables there
1 Like
Thank you. Regarding the other issue I mentioned, if I set CHPL_RT_NUM_THREADS_PER_LOCALE=4, I see about 400% cpu usage. But using –dataParTasksPerLocale=4 seems to do something else.
Hi @cpraveen —
The dataParTasksPerLocale option tells Chapel how many (max) tasks to create per forall loop by default, but does not put a cap on how many tasks will be created altogether. As a result, if you set it to 4 and had a nested forall loop (for example), you could get 20 tasks created on a node that has more than 20 cores—4 for the outer loop and 4 for each of the inner loops. With some choices of CHPL_TASKS we create a thread per core where that thread may spin idly looking for new tasks/work to run, which could drive your utilization above 400% even in a single forall loop. These are both reasons you could see greater than 400% utilization with a setting of 4.
In contrast, the CHPL_RT_NUM_THREADS_PER_LOCALE puts a hard cap on the number of threads that the runtime will use to run tasks, which is why you never see that go above 400%.
In general, my preference is not to set either of these flags, as I don't think we've spent a lot of time optimizing Chapel's behavior with them—for example, CHPL_RT_NUM_THREADS_PER_LOCALE raisaes questions for me about "Where will those threads run?" and "On a node with multiple sockets, will they be clumped on one socket or spread about between them?" (and which of those two behaviors would I want in a given program?). For those reasons, I'm always curious what leads users to reach for them, and would be curious what led you to try them out?
Thanks,
-Brad
1 Like
I had nested forall loops in my code which should explain what I was observing.
Is it a good idea to use too many cores when the problem size is not very big ? This raises the question: how to choose the number of threads to use.
I am testing a 2d PDE solver from here
on a 128x128 mesh.
Here are wall clock timings on a Linux and Mac.
If I use too many cores on my Linux, e.g., all 40 cores, it actually runs slower.
Apple M1 with just 4 cores is a lot faster !!!
Debian 12 with dual CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Chapel installed with deb file from chapel website
4 cores
$ time CHPL_RT_NUM_THREADS_PER_LOCALE=4 ./vte --Re 1000 > log.txt
real 6m10.832s
user 24m42.946s
sys 0m0.077s
6 cores
$ time CHPL_RT_NUM_THREADS_PER_LOCALE=6 ./vte --Re 1000 > log.txt
real 5m19.267s
user 31m55.047s
sys 0m0.085s
8 cores
$ time CHPL_RT_NUM_THREADS_PER_LOCALE=8 ./vte --Re 1000 > log.txt
real 4m59.711s
user 39m56.926s
sys 0m0.109s
12 cores
$ time CHPL_RT_NUM_THREADS_PER_LOCALE=12 ./vte --Re 1000 > log.txt
real 5m1.080s
user 60m11.685s
sys 0m0.125s
40 cores
$ time CHPL_RT_NUM_THREADS_PER_LOCALE=40 ./vte --Re 1000 > log.txt
real 11m58.530s
user 478m53.838s
sys 0m0.337s
Apple macmini M1, 4 cores
Chapel installed via homebrew
$ time ./vte --Re 1000 > log.txt
./vte --Re 1000 > log.txt 723.09s user 0.60s system 398% cpu 3:01.76 total
I think there are a few factors at play here.
First of all, yes if the problem size is too small too many cores will slow down performance. Otherwise you end up spending more time doing thread bookkeeping than actual work. CHPL_RT_NUM_THREADS_PER_LOCALE is the best way to control that, but its not a fine grained tool.
Being able to control the parallelism on a per-loop basis has been a design challenge we've wrestled with. Controlling `dataPar*` settings on a per-loop instead of per-distribution basis · Issue #23741 · chapel-lang/chapel · GitHub describes a similar problem for distributed domains, but more generally being able to have more control over the parallelism in forall loop is desirable. There is an internal issues about this (#5261 for those who can see it) which describes this really well. The idea is to have a configuration variable available for loops that want it.
But thats more future design work than something that can help today. I note in your code you are using stencil dist, so you might be able to set dataParTasksPerLocale of the stencil dist and control the parallelism that way.
As far as comparing your deb install on a x86 system to a brew install on an M1 system, its a bit of an unfair comparison.
Lastly, you may be interested in giving https://chapel-lang.org/docs/technotes/optimization.html a read.
-Jade
1 Like