Number of threads, Apple CPU, etc

cpraveen · October 31, 2025, 5:39am

On my M1 macmini I see chapel programs run with about 400% cpu. It has 4 performance and 4 efficiency cores, and chapel seems to use only 4 threads at max. Is this correct ?

On a computer with 40 cores, I see cpu at 4000%. It seems to be using all 40 cores as expected.

When I set

–dataParTasksPerLocale 6 it shows 1200% cpu, was expecting 600% cpu

–dataParTasksPerLocale 8 it shows 1400% cpu, was expecting 800% cpu

Could you tell me why it is like this ?

jabraham · October 31, 2025, 3:12pm

On a M series Mac (or any hybrid architecture), Chapel will by default only use the performance cores, not the small cores.

To change the binding Chapel uses, set the environment variable CHPL_RT_USE_PU_KIND to performance, efficiency, or all. On your Mac, all will use all the cores. By default, Chapel only uses the performance cores to prevent imbalance of the cores.

Note: the Chapel website seems to be having some issues right now, but you can find the docs for these variables there

cpraveen · November 1, 2025, 8:42am

Thank you. Regarding the other issue I mentioned, if I set CHPL_RT_NUM_THREADS_PER_LOCALE=4, I see about 400% cpu usage. But using –dataParTasksPerLocale=4 seems to do something else.

bradcray · November 1, 2025, 3:53pm

Hi @cpraveen —

The dataParTasksPerLocale option tells Chapel how many (max) tasks to create per forall loop by default, but does not put a cap on how many tasks will be created altogether. As a result, if you set it to 4 and had a nested forall loop (for example), you could get 20 tasks created on a node that has more than 20 cores—4 for the outer loop and 4 for each of the inner loops. With some choices of CHPL_TASKS we create a thread per core where that thread may spin idly looking for new tasks/work to run, which could drive your utilization above 400% even in a single forall loop. These are both reasons you could see greater than 400% utilization with a setting of 4.

In contrast, the CHPL_RT_NUM_THREADS_PER_LOCALE puts a hard cap on the number of threads that the runtime will use to run tasks, which is why you never see that go above 400%.

In general, my preference is not to set either of these flags, as I don't think we've spent a lot of time optimizing Chapel's behavior with them—for example, CHPL_RT_NUM_THREADS_PER_LOCALE raisaes questions for me about "Where will those threads run?" and "On a node with multiple sockets, will they be clumped on one socket or spread about between them?" (and which of those two behaviors would I want in a given program?). For those reasons, I'm always curious what leads users to reach for them, and would be curious what led you to try them out?

Thanks,
-Brad

cpraveen · November 6, 2025, 10:28am

I had nested forall loops in my code which should explain what I was observing.

Is it a good idea to use too many cores when the problem size is not very big ? This raises the question: how to choose the number of threads to use.

I am testing a 2d PDE solver from here

on a 128x128 mesh.

Here are wall clock timings on a Linux and Mac.

If I use too many cores on my Linux, e.g., all 40 cores, it actually runs slower.

Apple M1 with just 4 cores is a lot faster !!!

Debian 12 with dual CPU: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Chapel installed with deb file from chapel website

4 cores

$ time CHPL_RT_NUM_THREADS_PER_LOCALE=4   ./vte --Re 1000 > log.txt

real    6m10.832s
user    24m42.946s
sys     0m0.077s

6 cores

$ time CHPL_RT_NUM_THREADS_PER_LOCALE=6   ./vte --Re 1000 > log.txt

real    5m19.267s
user    31m55.047s
sys     0m0.085s

8 cores

$ time CHPL_RT_NUM_THREADS_PER_LOCALE=8   ./vte --Re 1000 > log.txt

real    4m59.711s
user    39m56.926s
sys     0m0.109s

12 cores

$ time CHPL_RT_NUM_THREADS_PER_LOCALE=12   ./vte --Re 1000 > log.txt

real    5m1.080s
user    60m11.685s
sys     0m0.125s

40 cores

$ time CHPL_RT_NUM_THREADS_PER_LOCALE=40   ./vte --Re 1000 > log.txt

real    11m58.530s
user    478m53.838s
sys     0m0.337s

Apple macmini M1, 4 cores
Chapel installed via homebrew

$ time ./vte --Re 1000 > log.txt
./vte --Re 1000 > log.txt 723.09s user 0.60s system 398% cpu 3:01.76 total

jabraham · November 6, 2025, 4:16pm

I think there are a few factors at play here.

First of all, yes if the problem size is too small too many cores will slow down performance. Otherwise you end up spending more time doing thread bookkeeping than actual work. CHPL_RT_NUM_THREADS_PER_LOCALE is the best way to control that, but its not a fine grained tool.

Being able to control the parallelism on a per-loop basis has been a design challenge we've wrestled with. Controlling `dataPar*` settings on a per-loop instead of per-distribution basis · Issue #23741 · chapel-lang/chapel · GitHub describes a similar problem for distributed domains, but more generally being able to have more control over the parallelism in forall loop is desirable. There is an internal issues about this (#5261 for those who can see it) which describes this really well. The idea is to have a configuration variable available for loops that want it.

But thats more future design work than something that can help today. I note in your code you are using stencil dist, so you might be able to set dataParTasksPerLocale of the stencil dist and control the parallelism that way.

As far as comparing your deb install on a x86 system to a brew install on an M1 system, its a bit of an unfair comparison.

The M1 CPU has a much higher clockspeed than that x86 chip, so the cores are beefer
The standard Chapel deb package sets CHPL_TARGET_CPU=none for portability reasons, so you might be missing out on possible vectorization opportunities on the x86 machine. Theres an issue about this, Allow user code specialization when `CHPL_TARGET_CPU=none` · Issue #25419 · chapel-lang/chapel · GitHub, but for your purposes today I recommend two things
- Compile your code with --ccflags -march=native as well as --fast. This should give you better performance even with CHPL_TARGET_CPU=none
- Compile from source to get a CHPL_TARGET_CPU=native build. If you just need local computation, this should do it
```
wget https://github.com/chapel-lang/chapel/releases/download/2.6.0/chapel-2.6.0.tar.gz
tar xf chapel-2.6.0 && cd chapel-2.6.0
source util/setchplenv.bash
make
```

Lastly, you may be interested in giving https://chapel-lang.org/docs/technotes/optimization.html a read.

-Jade

Topic		Replies	Views
CHPL_RT_NUM_THREADS_PER_LOCALE variable Users	11	550	May 10, 2022
How do I define locales Users	7	111	December 3, 2024
Hyperhreading Cores with Chapel Users	4	235	December 19, 2022
Chapel's programming model and features Users	3	334	July 6, 2023
Help needed: 1.30 make check fails on one of my machines Users	16	470	September 5, 2023

Number of threads, Apple CPU, etc

Related topics