CHAPEL performance test on WSL ubuntu

Jade,
I mentioned that when you are running on 72 logical processors machine and
Fortran without /Qparallel option Chapel is just 2 times faster, shouldn't
it be at least 10 times faster ? Am I missing here something? To check on
my machine I did Fortran build without /Qparallel and got execution time
(instead of 2.1 sec) 12.5 sec. And that makes sense to me since my machine
has 6 cores and 12 logical processors.

Igor

All of the timings for fortran I did at 32 logical cores and 72 logical cores was done with -parallel, using the bigger problem size. Without -parallel, it took long than I was willing to wait.

-Jade

Jade,
I installed intel fortran compiler on ubuntu, and ran the same drag.f90
test, it has 11.3 secs execution time, close enough to what I have running
under Windows, even slightly faster. I tried -parallel option and it did
not change anything, so it looks like it is either not working or I don't
know how to make it work in ubuntu.

Important conclusion - under Windows and ubuntu sequential runs take the
same time.

Do you still need Assembler code, I did not do this for ages, will run your
commands for Chapel generated executable, first.

Thanks,
Igor

Jade,
Please see attached.

Igor
dragasm.txt (380.9 KB)
dragfull.txt (383.9 KB)

1 Like

Nothing immediately jumps out at me as wrong from the assembly, but its hard to know whats different without the Fortran version to compare it to. Can you please also provide that?

I continue to be unable to replicate any performance difference. @arezaii also ran this on a Windows machine with WSL, but was also unable to replicate any performance difference.

-Jade

1 Like

Hi Igor,

As Jade said, I have been trying to reproduce your performance results on my WSL machine, but I have not had success reproducing your Fortran results showing the speed improvement.

First, I installed the Intel oneapi Fortran compiler in both Windows and WSL from the following link: https://www.intel.com/content/www/us/en/developer/tools/oneapi/fortran-compiler-download.html

I used the base Windows OS (Windows 10 Pro) and got the following results:

Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.1.0 Build 20250317
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.43.34810.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:dragf.exe
-subsystem:console
dragf.obj

C:\Users\Ahmad\git\fortran>dragf.exe
 RFU(1), RFU(nn):  -0.183924604199142      -0.199632208309082
 Wall clock time:   10.2740001678467      seconds

Moving to WSL (Ubuntu 22.04.3 LTS), the results are similar:

[ahmad@HECTOR ~]$ ifx dragf.f90 -o dragf -O3
[ahmad@HECTOR ~]$ ./dragf
 RFU(1), RFU(nn):  -0.183924604199142      -0.199632208309082
 Wall clock time:   10.0947742462158      seconds
[ahmad@HECTOR ~]$ ifx --version
ifx (IFX) 2025.1.0 20250317
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.

Doing the experiment with Chapel in WSL:

[ahmad@HECTOR chapel()]$ chpl --fast drag.chpl
[ahmad@HECTOR chapel()]$ ./drag --n=50000 --ns=2000
Execution Time = 2.47882

[ahmad@HECTOR chapel()]$ chpl --version
chpl version 2.4.0
  built with LLVM version 14.0.0
  available LLVM targets: m68k, xcore, x86-64, x86, wasm64, wasm32, ve, systemz, sparcel, sparcv9, sparc, riscv64, riscv32, ppc64le, ppc64, ppc32le, ppc32, nvptx64, nvptx, msp430, mips64el, mips64, mipsel, mips, lanai, hexagon, bpfeb, bpfel, bpf, avr, thumbeb, thumb, armeb, arm, amdgcn, r600, aarch64_32, aarch64_be, aarch64, arm64_32, arm64
Copyright 2020-2025 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

My CPU: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz

I have an 8 core/16 thread Intel 10700k and found that I could squeeze slightly more performance out of the original Chapel code by setting CHPL_RT_NUM_THREADS_PER_LOCALE=16, consistent with your findings that the timing results were better when setting this variable to 2x the number of cores for processors that support hyper-threading.

[ahmad@HECTOR chapel()]$ CHPL_RT_NUM_THREADS_PER_LOCALE=16 ./drag --n=50000 --ns=2000
Execution Time = 1.68287

I'm curious why I cannot replicate your Fortran results on my machine. Given that I don't do any Fortran compilation, I tried reproducing a working command based on your pasted build log, but did not see any improvement from using /Qparallel (or other options I tried) in Windows.

C:\Users\Ahmad\git\fortran>ifx dragf.f90 -O3 /Qparallel -o dragf
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.1.0 Build 20250317
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.43.34810.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:dragf.exe
-subsystem:console
dragf.obj

C:\Users\Ahmad\git\fortran>dragf.exe
 RFU(1), RFU(nn):  -0.183924604199142      -0.199632208309082
 Wall clock time:   10.1850004196167      seconds

I installed intel fortran compiler on ubuntu, and ran the same drag.f90
test, it has 11.3 secs execution time, close enough to what I have running
under Windows, even slightly faster.
...
Important conclusion - under Windows and ubuntu sequential runs take the
same time.

Can you provide the minimal set of compiler options that can reproduce your results where you're seeing the speedup in the Fortran code between compilations in Windows?

1 Like

Jade,
Please see attached. Let me know if that is good enough.

Thanks,
Igor

(attachments)

drag-f90-fromexe.txt (105 KB)

Ahmad,
When building under Windows you need to use /Qparallel option, that is
making dramatic change in my run, making it almost 2 times faster than
Chapel run. On ubuntu I was not able to get this 'parallel' option working

  • it does not tell me that anything wrong but gets about the same execution
    time.

Thanks,
Igor

@iarshavsky

Isn't that what Ahmad shows on the first line of his Fortran transcript?

Or are you suggesting something else is missing?

-Brad

Thank you for pointing this out, I did not see this line. I am doing my build from MSVS 2015, and use Intel Fortran 2019, and I attached build log, not sure what else I can do. I attached also disassembler output. Thanks !

To quote myself:

I'm curious why I cannot replicate your Fortran results on my machine.

I think this may be due to a difference in compilers. It seems there is no automatic parallelization feature in the new ifx compiler like there was in ifort. See -parallel at https://www.intel.com/content/www/us/en/developer/articles/guide/porting-guide-for-ifort-to-ifx.html

With ifort the -parallel compiler option auto-parallelization is enabled. That is not true for ifx; there is no auto-parallelization feature with ifx.

So this probably explains why I can't reproduce your results with ifx, it may also explain the speed difference you see between WSL and native Windows since you're presumably using ifx in WSL and ifort in Windows.

@iarshavsky, thanks for providing that disassembly! I'm not certain that is the disassembly of the right kernel, it seems to contain a lot of data and not a lot of code (but I am also not that familiar with the PE format). Regardless, it did contain a few key pieces of information that have allowed me to reproduce the issue and also come up with a few solutions.

Intel compilers by default try and link in SVML, which Chapel does not. This shows up in your assembly. In my own disassembly, I saw some SVML calls, but not a lot of vector code. I was missing -march=native. Adding that to the fortran compilation I can reproduce the performance gap, with fortran now matching the Chapel performance with 12 threads. So the full compilation command was ifort -O3 -parallel -march=native drag.f90 -o dragf.

However, with a little bit of heroics I can also auto vectorize Chapel code with SVML. Its unfortunately not possible yet with the LLVM backend, but by switching to the C backend I could do this. Here are the commands I used (which Igor you can also use on your machine to get similar results)

(cd $CHPL_HOME && export CHPL_TARGET_COMPILER=clang && nice make all -j`nproc` && unset CHPL_TARGET_COMPILER)
chpl --fast drag.chpl -o drag_llvm --target-compiler=llvm # use the LLVM backend
chpl --fast drag.chpl -o drag_c --target-compiler=clang # use the C backend
chpl --fast drag.chpl -o drag_c_svml --target-compiler=clang --no-ieee-float --ccflags -fveclib=SVML -L/path/to/intel/SVML/libraries -lsvml # use the C backend with SVML

Note that the version with SVML requires the use of the Chapel flag --no-ieee-float, which among other things turns of -ffast-math for clang. Without it, Chapel/Clang/LLVM will not auto-optimize to SVML (because you must opt-in to a lower floating point precision).

Running the SVML version results in a nice 1.3x speedup.

The last thing is a big difference between the cores on my machine and the cores on your machine. When you run with 12 threads, you are using all of the available logical cores. When I run with 12 threads, I am using 12 physical cores. This is going to make a big difference.

Summary of timings

I compared these versions N=100_000 and NS=5000

  • ifort -O3 -parallel -march=native drag.f90 -o dragf

  • chpl --fast drag.chpl -o drag_llvm --target-compiler=llvm

  • chpl --fast drag.chpl -o drag_c_svml --target-compiler=clang --no-ieee-float --ccflags -fveclib=SVML -L/path/to/intel/SVML/libraries -lsvml

  • chpl --fast drag2.chpl -o drag2_llvm --target-compiler=llvm

  • chpl --fast drag2.chpl -o drag2_c_svml --target-compiler=clang --no-ieee-float --ccflags -fveclib=SVML -L/path/to/intel/SVML/libraries -lsvml

  • ./dragf

    • Default: 3.8s
    • OMP_NUM_THREADS=12: 13.5s
    • OMP_NUM_THREADS=36: 5.1s
    • OMP_NUM_THREADS=72: 3.8s
  • ./drag_llvm

    • Default: 6.3s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 13.3s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 6.3s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 6.0s
  • ./drag_c_svml

    • Default: 5.4s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 10.9s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 5.4s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 5.0s
  • ./drag2_llvm

    • Default: 4.6s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 11.8s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 4.6s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 3.8s
  • ./drag2_c_svml

    • Default: 1.9s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 4.2s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 1.9s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 1.4s

So here is my summary of my best guess as to the issues causing the performance gap for you

  • Intel's Fortran thread pool implementation does a much better job at parallelizing with hyperthreads/logical cores. All other factors being the same, when both Intel Fortran and Chapel use only the physical cores the performance is the same, but Intel Fortran has an edge with hyperthreads.
  • By default, Chapel will not vectorize with SVML like ifort will. Turning this on brings Chapel to parity with Intel Fortran.
  • Using an explicit forall loop (drag2) gives Chapel a 2x speedup no matter what.

I've attached the three source files I used here
drag.chpl (1.7 KB)
drag.f90.txt (2.6 KB)
drag2.chpl (1.9 KB)

Hope this helps!
-Jade

2 Likes

Interesting results! :smiley: I've never used SVML, so just curious about the speedup...

And I have one minor question: The results with OMP_NUM_THREADS=12 are always slower than the "Default" ones. For example, in the Fortran case,

  • ./dragf
    • Default: 3.8s <-----
    • OMP_NUM_THREADS=12: 13.5s <-----
    • OMP_NUM_THREADS=36: 5.1s
    • OMP_NUM_THREADS=72: 3.8s

Does this mean that the "Default" result uses all physical cores (rather than a serial calculation)? (It seems so because the "Default" result is close to that obtained with NUM_THREADS=72, but for confirmation.)

(Also, I wonder if the following lines in the above post may be a typo of drag --> drag2...?)

chpl --fast drag2.chpl -o drag2_llvm ... (?)
chpl --fast drag2.chpl -o drag2_c_svml ... (?)
Last calculation --> ./drag2_c_svml (?)
1 Like

Jade,
Thanks so much!!!
Let me ask you to make sure I understand correctly. Fortran with 12 threads gives 13.5 sec, and drag2_c_svml - 4.2 secs. That is great, if I got it right. I don't understand why going from 36 to 72 threads (last run above) does not change timing that much, I guess limited number of cores causing that, right?

Let me make my runs and I will let you know.

Thanks again,
Igor

1 Like

I've never used SVML, so just curious about the speedup...

Note that I also got similar good results with Chapel using libmvec, compiling as --ccflags -fveclib=libmvec

And I have one minor question: The results with OMP_NUM_THREADS=12 are always slower than the "Default" ones. For example, in the Fortran case,

Correct. These numbers are collected on a 36 core (72 hyperthread) machine. So by setting OMP_NUM_THREADS, I am forcing the fortran code to only use 12 cores. This was motivated by @iarshavsky's machine having 12 hyperthreads.

By default, Fortran is defaulting to the number of hyperthreads (on this machine, 72). Chapel is defaulting to the number of physical cores (on this machine, 36)

(Also, I wonder if the following lines in the above post may be a typo of drag --> drag2...?)

Yes there is, I have edited the post and fixed it

2 Likes

Also note, this does actually work with the LLVM backend today, it just requires slightly different flags

chpl --fast drag.chpl -o drag_llvm_svml --target-compiler=llvm --no-ieee-float --mllvm -vector-library=SVML -L/path/to/intel/SVML/libraries -lsvml
2 Likes

Chapel typically does not get that much extra performance from using hyperthreads. 36 is the number of physical cores, 72 is the number of hyperthreads. I think there I will note that Fortran benefits a lot more from hyperthreads than Chapel does. This is just a factor of the different threadings models Fortran and Chapel use. I don't know the threading model of Intel Fortran, but Chapel generally binds 1 pthread per core, and then runs lightweight user-level QThreads on top of those. In the hyperthreading case, I believe Chapel binds 1 pthread per logical core (although I am not an expert on how the threads in Chapel get assigned in this case)

I completely agree. If Discourse supported a reaction stronger than a heart, @jabraham 's post would've gotten it from me. I find myself thinking there could be an interesting "coding war stories" blog article in here somewhere…

One thing I wanted to note that I don't believe has come up on this thread is that if a Chapel user doesn't want to worry about how many logical/physical cores their system has, they can set CHPL_RT_NUM_THREADS_PER_LOCALE to one of the symbolic values MAX_PHYSICAL (number of physical cores) or MAX_LOGICAL (number of logical cores or hyperthreads), which will cause the runtime to fill in the value for you. E.g., Jade could have used this rather than the hard-coded values of 36 and 72 above (Docs for this feature).

I'll also note that our choice to have Chapel default to MAX_PHYSICAL was made some time ago and may be overdue for re-evaluation. I've just opened Is defaulting the number of threads to #cores vs. #hyperthreads still the right choice? (MAX_PHYSICAL vs. MAX_LOGICAL) · Issue #27099 · chapel-lang/chapel · GitHub to capture this thought.

This matches my understanding, where the tasks are mapped to the pthreads via qthreads using the same strategy as in the MAX_PHYSICAL case (which I think of as being "generally round-robin with some reset heuristics around parallel loops that are likely to use all cores").

Thanks Jade,
-Brad

1 Like

Yes, that is great!!!! Chapel beats intel fortran in my test: 1.1 sec vs 2.1 sec. (n=50000, ns=2000).

So, I believe the issue resolved and I keep going forward. I assume that chapel will be doing both CPU and GPU runs with the same make.

Thanks a lot,
Igor

3 Likes

P.S. Would be nice also to use vector operations without necessity in 'forall' to speedup.

1 Like