CHAPEL performance test on WSL ubuntu

@iarshavsky, thanks for providing that disassembly! I'm not certain that is the disassembly of the right kernel, it seems to contain a lot of data and not a lot of code (but I am also not that familiar with the PE format). Regardless, it did contain a few key pieces of information that have allowed me to reproduce the issue and also come up with a few solutions.

Intel compilers by default try and link in SVML, which Chapel does not. This shows up in your assembly. In my own disassembly, I saw some SVML calls, but not a lot of vector code. I was missing -march=native. Adding that to the fortran compilation I can reproduce the performance gap, with fortran now matching the Chapel performance with 12 threads. So the full compilation command was ifort -O3 -parallel -march=native drag.f90 -o dragf.

However, with a little bit of heroics I can also auto vectorize Chapel code with SVML. Its unfortunately not possible yet with the LLVM backend, but by switching to the C backend I could do this. Here are the commands I used (which Igor you can also use on your machine to get similar results)

(cd $CHPL_HOME && export CHPL_TARGET_COMPILER=clang && nice make all -j`nproc` && unset CHPL_TARGET_COMPILER)
chpl --fast drag.chpl -o drag_llvm --target-compiler=llvm # use the LLVM backend
chpl --fast drag.chpl -o drag_c --target-compiler=clang # use the C backend
chpl --fast drag.chpl -o drag_c_svml --target-compiler=clang --no-ieee-float --ccflags -fveclib=SVML -L/path/to/intel/SVML/libraries -lsvml # use the C backend with SVML

Note that the version with SVML requires the use of the Chapel flag --no-ieee-float, which among other things turns of -ffast-math for clang. Without it, Chapel/Clang/LLVM will not auto-optimize to SVML (because you must opt-in to a lower floating point precision).

Running the SVML version results in a nice 1.3x speedup.

The last thing is a big difference between the cores on my machine and the cores on your machine. When you run with 12 threads, you are using all of the available logical cores. When I run with 12 threads, I am using 12 physical cores. This is going to make a big difference.

Summary of timings

I compared these versions N=100_000 and NS=5000

  • ifort -O3 -parallel -march=native drag.f90 -o dragf

  • chpl --fast drag.chpl -o drag_llvm --target-compiler=llvm

  • chpl --fast drag.chpl -o drag_c_svml --target-compiler=clang --no-ieee-float --ccflags -fveclib=SVML -L/path/to/intel/SVML/libraries -lsvml

  • chpl --fast drag2.chpl -o drag2_llvm --target-compiler=llvm

  • chpl --fast drag2.chpl -o drag2_c_svml --target-compiler=clang --no-ieee-float --ccflags -fveclib=SVML -L/path/to/intel/SVML/libraries -lsvml

  • ./dragf

    • Default: 3.8s
    • OMP_NUM_THREADS=12: 13.5s
    • OMP_NUM_THREADS=36: 5.1s
    • OMP_NUM_THREADS=72: 3.8s
  • ./drag_llvm

    • Default: 6.3s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 13.3s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 6.3s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 6.0s
  • ./drag_c_svml

    • Default: 5.4s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 10.9s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 5.4s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 5.0s
  • ./drag2_llvm

    • Default: 4.6s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 11.8s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 4.6s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 3.8s
  • ./drag2_c_svml

    • Default: 1.9s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=12: 4.2s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=36: 1.9s
    • CHPL_RT_NUM_THREADS_PER_LOCALE=72: 1.4s

So here is my summary of my best guess as to the issues causing the performance gap for you

  • Intel's Fortran thread pool implementation does a much better job at parallelizing with hyperthreads/logical cores. All other factors being the same, when both Intel Fortran and Chapel use only the physical cores the performance is the same, but Intel Fortran has an edge with hyperthreads.
  • By default, Chapel will not vectorize with SVML like ifort will. Turning this on brings Chapel to parity with Intel Fortran.
  • Using an explicit forall loop (drag2) gives Chapel a 2x speedup no matter what.

I've attached the three source files I used here
drag.chpl (1.7 KB)
drag.f90.txt (2.6 KB)
drag2.chpl (1.9 KB)

Hope this helps!
-Jade

2 Likes