CHAPEL performance test on WSL ubuntu

I am trying to test CHAPEL performance having in mind potential use for nuclear engineering safety analysis codes (NUC), or real-time training simulation codes.

The motivation is coming from the following rationale. NUC codes typically use just one computer processor, the parallelization efforts are limited to course-grained parallelization using either MPI or proprietary tools, in such case the executables are split into, typically, about 5 executables running in parallel and getting data exchanged and synchronized.

My work laptop has 16 cores, so I anticipate that few years from now commonly affordable desktops will have >64 cores. Effective utilization of those has potential of 100 times performance boost, which sounds good, and is promising from business perspective (competitiveness, profit).

I installed ubuntu 24.04.2 running under MS Windows 11 Pro, having Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz 2.59 GHz. The machine has 1 GPU, 6 cores and 12 logical processors. CHAPEL was built with gpu support.

drag.chpl (1.6 KB)
dragg.chpl (2.5 KB)

In the uploaded tests I was trying to use typical data needed and calculation performed (in small piece, of course), the number of nodes used is n=20000, and number of steps ns=1000.
By doing:
chpl --fast drag.chpl
./drag --n=20000 --ns=1000

I am getting execution time of about 3 sec
gragg (run on gpu) gives execution time of about 2 sec.

I got the best timing when

export CHPL_RT_NUM_THREADS_PER_LOCALE=12

meaning, when the number of threads are set to the number of logical processors.
By changing the above export I see time change, and also it is clear from Windows Task Manager that all machine processors are in use.

Since the legacy code I am aiming to parallelize is written in old-good Fortran, I wrote similar test attached which was compiled using Intel Fortran.
drag-f90.txt (2.9 KB)

Please note that I added random multipliers to avoid 'too smart' compiler optimization.

The execution time is about 0.4 sec.

./drag --n=50000 --ns=2000
gives Execution time 16.2 sec
./dragg--n=50000 --ns=2000
gives Execution time 5.7 sec

Fortran execution time for the same is 2.12 sec

Any thoughts, clues ???
Help would be appreciated greatly !!!

1 Like

Hi Igor!

Your dragg.chpl runs in 2.8 seconds on my RTX A2000 with --n=50000 --ns=2000. I think the first thing to address is your use of whole array operations. Everything you have in your number-crunching logic uses whole arrays instead of accessing them at given indices. That's perfectly fine, however, each statement you have will be a separate kernel launch on the GPU. That results in more kernel launch overhead and smaller arithmetic intensity for each kernel. Given that all your arrays are of same size, can you change your logic to have an explicit and single forall:

      for ii in 1..ns do {

       VEL  = VEL_S;
       VD   = VD_S;
       RHOF = RHOF_S;
       RHOG = RHOG_S;
       VS   = VS_S;

       forall i in 1..n {

         RHO[i] = VD[i] * RHOG[i] + (1.0-VD[i]) * RHOF[i];

         //RHO_S = RHO;

         RE[i]  = RHO[i] * VEL[i] * DH[i] / VS[i];

         //RE_S = RE;

         XM[i] = VEL[i] * VEL[i] * (1.0 / (1.0 +(1.0/VD[i] - 1.0)));

         //XM_S = XM;

         RFB[i]  = XM[i]**0.97 * (FA[i] + FB[i]*RE[i]**(-FC[i]*0.97));
         RFB[i] += XM[i]**0.98 * (FA[i] + FB[i]*RE[i]**(-FC[i]*0.98));
         RFB[i] += XM[i]**0.99 * (FA[i] + FB[i]*RE[i]**(-FC[i]*0.99));

         //RFB_S = RFB;

         RFU[i]  = 1.0 /(-2.0 + log(RHK[i] + 2.5/(RE[i]**1.2) * (1.1 - 2.0 * log (RHK[i] + 21.25/RE[i]**0.90))));
         RFU[i] += (1.0-VD[i]) /(-2.1 + log(RHK[i] + 2.6/(RE[i]**1.3) * (1.2 - 2.1 * log (RHK[i] + 21.35/RE[i]**0.92))));
         RFU[i] += RFB[i] /(-2.2 + log(RHK[i] + 2.7/(RE[i]**1.4) * (1.3 - 2.2 * log (RHK[i] + 21.45/RE[i]**0.94))));


       }

       RFU_S = RFU;

       //writeln(RFU);
    } // for ns

This implementation cuts down the execution time to 1.8 on my GPU. I don't think it sacrifices much on readability either. Besides, your Fortran code also indexes arrays directly, so if anything, this makes codes more similar.

I also have 2 questions:

  1. Your array declarations for DH, RHK and many others are within dragt. Meaning, everytime that function is called, they will be reallocated and zeroed out. Given that you have a logic for the first invocation of dragt, I don't think that's what you want.
  2. Similarly, where you assign VEL, VD, RHOF, RHOG, and VS from CPU arrays are within for ii loop. In that loop, these don't change at all. So, right now, it looks like you are redundantly copying them. I am not great at Fortran, but I don't see those copies in Fortran. Just removing those don't make too much of difference in my system (maybe the compiler removes them even?), but wanted to note nonetheless.

Engin

Engin,

Let me try to answer your questions:

  1. Arrays you are talking about should be on GPU, so I put the declarations
    under 'on loc', is there a better option ? I tried '=noinit', but that did
    not change anything.
  2. All those arrays had to be moved from host to GPU on every step to get
    realistic time measurements.

I tried to replace array operations with 'forall' and did not get any
improvement, so I removed these since the code is looking nicer and the
compiler is expected to be doing that anyways.

When running Intel Fortran made exe I am getting 2 secs execution time and
that is matching your GPU time.
Can you run on your machine the following (more realistic test):

./dragg --n=20000 --ns=1000

I have an execution time of 1.8 sec.

Major question/concern remains : when running on CPU (./drag --n=20000
--ns=1000) the execution time of Intel Fortran is much smaller than of
CHAPEL (0.4 vs 1.8). Is it CHAPEL build problem or is it something still to
be worked out on ubuntu platform, and maybe some others, I assume you guys
checked CHAPEL vs C/C++ performance on Cray systems and such.

And also, I don't understand why with a smaller array size GPU run becomes
much closer to CPU run ? If array size is 10000 or 20000 the GPU does not
give any advantage, why is that ???

Thanks a lot,

Igor

I get 0.59 seconds with a single forall and 1.08 seconds with the original implementation.


I want to understand the end goal a little bit better here. Do you want to improve your CPU performance?

Comparing CPU and GPU performances against one another is difficult at times, if that's what you want, though, it feels like it is a separate topic.

Engin

I also just ran a quick test on your CPU implementation. Are you using CHPL_LOCALE_MODEL=gpu while doing that test?

Your CPU-only implementation takes 2.64 seconds with that environment variable is set and 0.68 without it. Another way of getting better performance while having the GPU support enabled is to remove on loc.

This is a performance issue we are aware of, but we haven't been able to prioritize the work for it so far. If you have the GPU support enabled and use an on statement, the compiler cannot guarantee whether things will execute or be allocated on the GPU. Therefore, it generates code that will be able to handle all eventualities. If you know you are not going to use the GPU, dropping the on statement, or simply using non-GPU-enabled Chapel can give you couple of times faster performance.

Also note that you can have two Chapel builds in conjunction with another when it comes to GPU support. e.g.:

> source util/setchplenv.bash
> make -j12  # build a non-gpu enabled Chapel
> export CHPL_LOCALE_MODEL=gpu
> make -j12  # build the GPU enabled version

> chpl hello.chpl # since you have `CHPL_LOCALE_MODEL=gpu`, this executable can use GPUs
> unset CHPL_LOCALE_MODEL
> chpl hello.chpl # without rebuilding Chapel, this executable will not be able to use GPUs.

Hope this helps,
Engin

1 Like

Engin,
Thanks so much !
Yes, unset CHPL_LOCALE_MODEL makes it run twice faster.
And I see your GPU is twice faster compared to mine.
And, you are right, forall works better, I hope that will be improved down
the road, the compiler can do this so we could use nice vector notations.
Still, GPU helps only when arrays' size is large enough (such as 50000). If
that is truly the case in our applications GPU use can only be a very
exceptional case.

Major problem/concern I have is - why CHAPEL CPU run (drag) is working at
least 3 times slower than Intel Fortran code, which is doing the same.
This issue is looking like a showstopper, I can't suggest using CHAPEL if
it is 3 times slower than FORTRAN. I feel like this needs to be addressed.

Thanks !!!

Igor

Major problem/concern I have is - why CHAPEL CPU run (drag) is working at
least 3 times slower than Intel Fortran code, which is doing the same.
This issue is looking like a showstopper, I can't suggest using CHAPEL if
it is 3 times slower than FORTRAN. I feel like this needs to be addressed.

This is after disabling the GPU support, is that right?

Correct, after unset $CHPL_LOCALE_MODEL

I cannot replicate this on a Intel Skylake. Would you be able to provide more details about your Chapel installation (run printchplenv --all) and how you are compiling the code? I assumed for fortran you just used -O3 in my experiments.

> chpl --fast drag.chpl
> ifx -O3 -o dragf drag.f90

> ./drag --n 20000 --ns 1000
Execution Time = 0.443088
> ./dragf
 Wall clock time:   2.90558290481567      seconds
> ifx --version
ifx (IFORT) 2023.0.0 20221201
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

> chpl --version
chpl version 2.5.0 pre-release (f043307e2e)
  built with LLVM version 19.1.3
  available LLVM targets: amdgcn, r600, nvptx64, nvptx, aarch64_32, aarch64_be, aarch64, arm64_32, arm64, x86-64, x86
Copyright 2020-2025 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)
> printchplenv --all --anon
CHPL_HOST_PLATFORM: linux64 *
CHPL_HOST_COMPILER: clang
  CHPL_HOST_CC: /.../llvm-19.1.0-5lt3mpbjgq76ogn7xtnshvqvpvqko2zd/bin/clang
  CHPL_HOST_CXX: /.../llvm-19.1.0-5lt3mpbjgq76ogn7xtnshvqvpvqko2zd/bin/clang++
CHPL_HOST_ARCH: x86_64
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
  CHPL_TARGET_CC: /.../llvm-19.1.0-5lt3mpbjgq76ogn7xtnshvqvpvqko2zd/bin/clang
  CHPL_TARGET_CXX: /.../llvm-19.1.0-5lt3mpbjgq76ogn7xtnshvqvpvqko2zd/bin/clang++
  CHPL_TARGET_LD: /.../llvm-19.1.0-5lt3mpbjgq76ogn7xtnshvqvpvqko2zd/bin/clang++
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none *
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_HOST_MEM: jemalloc
  CHPL_HOST_JEMALLOC: bundled
CHPL_TARGET_MEM: jemalloc
  CHPL_TARGET_JEMALLOC: bundled
CHPL_ATOMICS: cstdlib
CHPL_GMP: none *
CHPL_HWLOC: bundled
  CHPL_HWLOC_PCI: disable
CHPL_RE2: bundled
CHPL_LLVM: system
  CHPL_LLVM_SUPPORT: system
  CHPL_LLVM_CONFIG: /.../llvm-19.1.0-5lt3mpbjgq76ogn7xtnshvqvpvqko2zd/bin/llvm-config *
  CHPL_LLVM_VERSION: 19
CHPL_AUX_FILESYS: none
CHPL_LIB_PIC: none
CHPL_SANITIZE: none
CHPL_SANITIZE_EXE: none
1 Like

Does the timing change if you add the following line at the end of the Fortran code?

...
print *, "RFU(1), RFU(nn) = ", RFU(1), RFU(nn) !! <-- print something

END PROGRAM DRAG 

On my Linux PC, the timing with the original code (drag-orig.f90) is:

$ gfortran-12 drag-orig.f90
$ time ./a.out
 Wall clock time:   5.8938975334167480      seconds
real	0m5.898s
user	0m5.892s
sys	0m0.005s

$ gfortran-12 -O3 drag-orig.f90
$ time ./a.out
 Wall clock time:   1.1181000445503742E-005 seconds

real	0m0.004s
user	0m0.002s
sys	0m0.001s

So at the -O3 level, gfortran seems to have optimized away all the calculations. If I add the print line at the end of the program (to prevent such optimization), I get:

$ gfortran-12 drag-mod.f90 
$ time ./a.out
 Wall clock time:   5.8653631210327148      seconds
 RFU(1), RFU(nn) =  -0.20124437462896888      -0.20116448934421405     

real	0m5.870s
user	0m5.866s
sys	0m0.003s

$ gfortran-12 -O3 drag-mod.f90 
$ time ./a.out
 Wall clock time:   2.6709594726562500      seconds
 RFU(1), RFU(nn) =  -0.20151005672274283      -0.20134404580057619     

real	0m2.675s
user	0m2.672s
sys	0m0.002s

I do not have Intel Fortran (ifx) nor Chapel installed on this PC, so cannot compare the timing directly, but I guess ifx may have optimized away some (or most?) of the calculations, while evaluating the intrinsic subroutine random_number() many times (e.g., because the compiler cannot determine if there are side effects).

FWIW, the intrinsic random_number routine in Intel Fortran is rather slow (if I remember correctly!) because the compiler is still using an algorithm developed in 1990's or earlier. Indeed, Intel seems to be not interested in making random_number() faster (possibly because MKL has a faster vectorized version). On the other hand, random_number() in gfortran uses xoshiro256** (according to the man page) and so rather fast.

Please see below:

igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ chpl --version
warning: The prototype GPU support implies --no-checks. This may impact
debuggability. To suppress this warning, compile with --no-checks explicitly
chpl version 2.4.0
built with LLVM version 18.1.3
available LLVM targets: xtensa, m68k, xcore, x86-64, x86, wasm64, wasm32,
ve, systemz, sparcel, sparcv9, sparc, riscv64, riscv32, ppc64le, ppc64,
ppc32le, ppc32, nvptx64, nvptx, msp430, mips64el, mips64, mipsel, mips,
loongarch64, loongarch32, lanai, hexagon, bpfeb, bpfel, bpf, avr, thumbeb,
thumb, armeb, arm, amdgcn, r600, aarch64_32, aarch64_be, aarch64, arm64_32,
arm64
Copyright 2020-2025 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ printchplenv --all --anon
CHPL_HOST_PLATFORM: linux64 *
CHPL_HOST_COMPILER: gnu
CHPL_HOST_CC: gcc
CHPL_HOST_CXX: g++
CHPL_HOST_ARCH: x86_64
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_CC: /usr/lib/llvm-18/bin/clang
--gcc-install-dir=/usr/local/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0
CHPL_TARGET_CXX: /usr/lib/llvm-18/bin/clang++
--gcc-install-dir=/usr/local/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0
CHPL_TARGET_LD: /usr/lib/llvm-18/bin/clang++
--gcc-install-dir=/usr/local/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
CHPL_GPU: nvidia
CHPL_GPU_SDK_VERSION: 12.8
CHPL_GPU_MEM_STRATEGY: array_on_device
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_HOST_MEM: jemalloc
CHPL_HOST_JEMALLOC: bundled
CHPL_TARGET_MEM: jemalloc
CHPL_TARGET_JEMALLOC: bundled
CHPL_ATOMICS: cstdlib
CHPL_GMP: none
CHPL_HWLOC: bundled
CHPL_HWLOC_PCI: enable
CHPL_RE2: bundled
CHPL_LLVM: system *
CHPL_LLVM_SUPPORT: system
CHPL_LLVM_CONFIG: /usr/lib/llvm-18/bin/llvm-config *
CHPL_LLVM_VERSION: 18
CHPL_AUX_FILESYS: none
CHPL_LIB_PIC: none
CHPL_SANITIZE: none
CHPL_SANITIZE_EXE: none

igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ ./drag --n=20000 --ns=1000
Execution Time = 2.80186

PS C:\Projects\CHAPEL> ./DRAGF.exe
Wall clock time: 0.406000000000000 seconds

After print addition:
PS C:\Projects\CHAPEL> ./DRAGF.exe
RFU(1), RFU(nn): -0.158249222751921 -0.185814288562971
Wall clock time: 0.396000000000000 seconds
PS C:\Projects\CHAPEL>

Here is the build log:

Compiling with Intel(R) Visual Fortran Compiler 19.0.5.281 [IA-32]...
ifort /nologo /O2 /Qparallel /integer-size:64 /real-size:64
/module:"Release\" /object:"Release\" /Fd"Release\vc140.pdb"
/libs:dll /threads /c /Qlocation,link,"C:\Program Files
(x86)\Microsoft Visual Studio 14.0\VC\bin" /Qm32
"C:\Projects\3KEYRELAP\Release-August-2024\Development\Tasks\DRAGF\drag.f90"
Linking...
Link /OUT:"Release\DRAGF.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST
/MANIFESTFILE:"Release\DRAGF.exe.intermediate.manifest"
/MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE
/IMPLIB:"C:\Projects\3KEYRELAP\Release-August-2024\Development\Tasks\DRAGF\Release\DRAGF.lib"
-qm32 "Release\drag.obj"
Embedding manifest...
mt.exe /nologo /outputresource:"C:\Projects\3KEYRELAP\Release-August-2024\Development\Tasks\DRAGF\Release\DRAGF.exe;#1"
/manifest "Release\DRAGF.exe.intermediate.manifest"

DRAGF - 0 error(s), 0 warning(s)

Let me attach Fortran source file, just to make sure we have the same.

Thanks,

Igor

(attachments)

drag-f90.txt (2.98 KB)

1 Like

Engin,
I am sorry, I did mistake somehow, just mentioned that gpu flag was still
on, so I did unset, make, and ran again, it is much better now, see below:

igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ chpl --fast drag.chpl
igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ ./drag --n=20000 --ns=1000
Execution Time = 0.811797
igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ ./drag --n=20000 --ns=1000
Execution Time = 0.819851
igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$

Adding 'forall' does not change the timing, just a little bit.

So, much better now, but still CHAPEL is about 2 times slower.

With
igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ ./dragf --n=50000 --ns=2000
Execution Time = 3.59767
igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$

Intel Fortran gives:

PS C:\Projects\CHAPEL> ./DRAGF.exe
RFU(1), RFU(nn): -0.183924608702411 -0.199632213602153
Wall clock time: 2.12000000000000 seconds
PS C:\Projects\CHAPEL>

So, it is 1.7 times slower.

chplenv is below:

igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ printchplenv --all --anon
CHPL_HOST_PLATFORM: linux64 *
CHPL_HOST_COMPILER: gnu
CHPL_HOST_CC: gcc
CHPL_HOST_CXX: g++
CHPL_HOST_ARCH: x86_64
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_CC: /usr/lib/llvm-18/bin/clang
--gcc-install-dir=/usr/local/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0
CHPL_TARGET_CXX: /usr/lib/llvm-18/bin/clang++
--gcc-install-dir=/usr/local/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0
CHPL_TARGET_LD: /usr/lib/llvm-18/bin/clang++
--gcc-install-dir=/usr/local/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: flat
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_HOST_MEM: jemalloc
CHPL_HOST_JEMALLOC: bundled
CHPL_TARGET_MEM: jemalloc
CHPL_TARGET_JEMALLOC: bundled
CHPL_ATOMICS: cstdlib
CHPL_GMP: none
CHPL_HWLOC: bundled
CHPL_HWLOC_PCI: disable
CHPL_RE2: bundled
CHPL_LLVM: system *
CHPL_LLVM_SUPPORT: system
CHPL_LLVM_CONFIG: /usr/lib/llvm-18/bin/llvm-config *
CHPL_LLVM_VERSION: 18
CHPL_AUX_FILESYS: none
CHPL_LIB_PIC: none
CHPL_SANITIZE: none
CHPL_SANITIZE_EXE: none
igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$

Thanks,
Igor

Trying to draw sort of bottom line of this discussion.

After suggested setup change to make Chapel gpu support off I have the following:

On Intel 2.6 GHz machine, having 6 cores and 12 logical processors, Chapel execution time (drag,chpl attached above) is 1.7-2.0 times larger than Intel Fortran compiler built executable, although Fortran program is doing small extra work applying random number multipliers.

Help of resolving the issue would be greatly appreciated.

Reading through this post, the latest timings you posted for Chapel (running in WSL) and Fortran (running in native Windows)

igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ ./drag --n=20000 --ns=1000
Execution Time = 0.811797
igor@AIMobile10:~/Chapel/chapel-2.4.0/examples$ ./drag --n=20000 --ns=1000
Execution Time = 0.819851

VS

PS C:\Projects\CHAPEL> ./DRAGF.exe
RFU(1), RFU(nn): -0.183924608702411 -0.199632213602153
Wall clock time: 2.12000000000000 seconds

I am assuming this is the same Fortran file you listed above, which has nn=20000 and ns=1000, so the same as the Chapel version. This shows Chapel being faster?

On my machine, the Chapel version consistency takes .6 seconds. That is compiled with only --fast.

The Fortran version varies, but the fastest version was compiling as ifort -O3 frag.f90 -o dragf, which took 1.2 seconds. I also tried with -parallel, but that made the code slower.

I also tried a bigger problem size. N=100000 and NS=5000. The Chapel version took 13 seconds with CHPL_RT_NUM_THREADS_PER_LOCALE=12. The Fortran version took 25 seconds compiled with -parallel and OMP_NUM_THREADS=12 (the non-parallel version took even longer).

I am running on a 72 logical core machine, so I tried scaling it up to more cores too. With 32 tasks/threads, Chapel takes 6.7s and Fortran takes 10.6s. With 72 tasks/threads, Chapel takes 6s and Fortran takes 6.5s.

I cannot replicate a case where Fortran is faster than Chapel, at best they are mostly equivalent. A few key differences between our systems

  • I have LLVM 19, you have LLVM 18. This shouldn't matter that much
  • I have ifort/ifx 2021 on linux, you have visual fortran 19 on windows.
  • I have a 72 logical core skylake, you have a 12 logical core coffeelake.

At this point, it may just come down to machine differences. Note that you are running Chapel through WSL while fortran through Windows natively. There may be overheads associated with running in WSL, can you run both in WSL?

Lastly, if you are able to isolated the key kernels in each binary and send them to me, I can take a look at the assembly and see if anything sticks out as obviously bad that could explain a difference.

-Jade

1 Like

Intel Fortran compiler built executable is faster for both runs: 50000, 2000; and 20000, 1000, all in range of 1.7-2.0 times faster. Yes, I am running Fortran under Windows. I will attach both executables later on today. Thanks!!!

To avoid confusion, here the numbers again. Chapel vs Fortran run time:

20000, 1000 run --- 0.8 vs 0.4 secs
50000, 2000 run ---- 3.6 vs 2.1 secs.

Thanks!

I won't be able to run or view the raw executables. Can you disassembly them and share the relevant assembly?

Also, if possible it would be interesting to run the fortran version in WSL instead of native windows.

-Jade

Jade,
I don't have Intel Fortran compiler for WSL or Linux. How do I make Chapel disassembly?

Thanks,
Igor

I've created the CompilerExplorer page for the above Fortran code, with which some info may be obtained about rough timing. (And with that page I also confirmed that ifort/ifx does not optimize the calculation away.)

I wonder the native Windows version performs the iii loop with threading? (because /Qparallel option seems attached).

!! fortran code

      do oo = 1, ns
          !! call random_number(uu)
          uu = 0.5 / dble(oo)   !!<--- I've changed this to be deterministic (for comparison)
          xx = (bb-aa)*uu + aa

      do  iii = 1, nn  !! <---performed in parallel by ifort(Win,native)?
       
       RHO(iii) = VD(iii) * RHOG(iii) + (1.0-VD(iii)) * RHOF(iii) + 10.0*xx
       RE(iii)  = RHO(iii) * VEL(iii) * DH(iii) / VS(iii)
       XM(iii) = VEL(iii) * VEL(iii) * (1.0 / (1.0 +(1.0/(VD(iii)*xx) - 1.0)))
! 
       RFB(iii) = XM(iii)**0.97 * (FA(iii) + FB(iii)*RE(iii)**(-FC(iii)*0.97))
       RFB(iii) = RFB(iii) + XM(iii)**0.98 * (FA(iii) + FB(iii)*RE(iii)**(-FC(iii)*0.98))
       RFB(iii) = RFB(iii) + XM(iii)**0.99 * (FA(iii) + FB(iii)*RE(iii)**(-FC(iii)*0.99))
!
       RFU(iii)  = 1.0 /(-2.0 + log(RHK(iii) + 2.5/(RE(iii)**1.2)       &
                * (1.1 - 2.0 * log (RHK(iii) + 21.25/RE(iii)**0.90))))
!
       RFU(iii)  = RFU(iii) + 1.0 /(-2.0 + log(RHK(iii) +               &
         2.6/(RE(iii)**1.3) * (1.2 - 2.1 * log (RHK(iii) +              &       
         21.35/RE(iii)**0.92))))
!
       RFU(iii)  = RFU(iii) + RFB(iii)/(1.0 /(-2.0 + log(RHK(iii) +     &
         2.7/(RE(iii)**1.4) * (1.3 - 2.2 * log (RHK(iii) +              &       
         21.45/RE(iii)**0.94)))))
!     
       enddo
       enddo ! oo

In the Chapel version (WSL), I also wonder if the individual statements of whole-array assignment are performed separately in parallel?

// chapel code
     for ii in 1..ns do {

       RHO = VD * RHOG + (1.0-VD) * RHOF;   //<-- each line performed in parallel?
       RE  = RHO * VEL * DH / VS;
       XM = VEL * VEL * (1.0 / (1.0 +(1.0/VD - 1.0)));
 
       RFB  = XM**0.97 * (FA + FB*RE**(-FC*0.97));
       RFB += XM**0.98 * (FA + FB*RE**(-FC*0.98));
       RFB += XM**0.99 * (FA + FB*RE**(-FC*0.99));
    
       RFU  = 1.0 /(-2.0 + log(RHK + 2.5/(RE**1.2) * (1.1 - 2.0 * log(RHK + 21.25/RE**0.90))));
       RFU += (1.0-VD) /(-2.1 + log(RHK + 2.6/(RE**1.3) * (1.2 - 2.1 * log(RHK + 21.35/RE**0.92))));
       RFU += RFB /(-2.2 + log(RHK + 2.7/(RE**1.4) * (1.3 - 2.2 * log(RHK + 21.45/RE**0.94))));

      } // for ns

If so, rewriting the code in a way similar to Fortran (using an explicit loop index) and using forall for the loop index (like iii) gives a different timing...?

(It may also be better to write "checksum" (as in the CompilerExplorer page) in the Chapel code to ensure that the computation is identical.)

1 Like

@tbzy that is likely. The sub expressions should be coalesce properly, but each separate line is going to be a separate thread invocation

I rewrote the core loop with a single forall and saw a 2x speed up in the Chapel code

        forall iii in 1..n {
          RHO(iii) = VD(iii) * RHOG(iii) + (1.0-VD(iii)) * RHOF(iii);
          RE(iii)  = RHO(iii) * VEL(iii) * DH(iii) / VS(iii);
          XM(iii) = VEL(iii) * VEL(iii) * (1.0 / (1.0 +(1.0/VD(iii) - 1.0)));
    
          RFB(iii)  = XM(iii)**0.97 * (FA(iii) + FB(iii)*RE(iii)**(-FC(iii)*0.97));
          RFB(iii) += XM(iii)**0.98 * (FA(iii) + FB(iii)*RE(iii)**(-FC(iii)*0.98));
          RFB(iii) += XM(iii)**0.99 * (FA(iii) + FB(iii)*RE(iii)**(-FC(iii)*0.99));
        
          RFU(iii)  = 1.0 /(-2.0 + log(RHK(iii) + 2.5/(RE(iii)**1.2) * (1.1 - 2.0 * log(RHK(iii) + 21.25/RE(iii)**0.90))));
          RFU(iii) += (1.0-VD(iii)) /(-2.1 + log(RHK(iii) + 2.6/(RE(iii)**1.3) * (1.2 - 2.1 * log(RHK(iii) + 21.35/RE(iii)**0.92))));
          RFU(iii) += RFB(iii) /(-2.2 + log(RHK(iii) + 2.7/(RE(iii)**1.4) * (1.3 - 2.2 * log(RHK(iii) + 21.45/RE(iii)**0.94))));
        }

@iarshavsky can you try this as well?

And to answer your question about disassembly, if you compile with --fast --llvm-print-ir dragt --llvm-print-ir-stage asm, it will compile and dump the assembly for the dragt function. If you do that, it would also be helpful to provide the output of --fast --llvm-print-ir dragt --llvm-print-ir-stage full, which will provide the LLVM IR assembly (which is more readable than x86 assembly). Of course, you can also use objdump on the executable, but the Chapel compiler flags are probably easier

You should be able to install the Intel Fortran compiler for Linux on WSL, I believe Intel provides binaries for free for linux

-Jade

1 Like