Hi Igor,
As Jade said, I have been trying to reproduce your performance results on my WSL machine, but I have not had success reproducing your Fortran results showing the speed improvement.
First, I installed the Intel oneapi Fortran compiler in both Windows and WSL from the following link: https://www.intel.com/content/www/us/en/developer/tools/oneapi/fortran-compiler-download.html
I used the base Windows OS (Windows 10 Pro) and got the following results:
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.1.0 Build 20250317
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.43.34810.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:dragf.exe
-subsystem:console
dragf.obj
C:\Users\Ahmad\git\fortran>dragf.exe
RFU(1), RFU(nn): -0.183924604199142 -0.199632208309082
Wall clock time: 10.2740001678467 seconds
Moving to WSL (Ubuntu 22.04.3 LTS), the results are similar:
[ahmad@HECTOR ~]$ ifx dragf.f90 -o dragf -O3
[ahmad@HECTOR ~]$ ./dragf
RFU(1), RFU(nn): -0.183924604199142 -0.199632208309082
Wall clock time: 10.0947742462158 seconds
[ahmad@HECTOR ~]$ ifx --version
ifx (IFX) 2025.1.0 20250317
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.
Doing the experiment with Chapel in WSL:
[ahmad@HECTOR chapel()]$ chpl --fast drag.chpl
[ahmad@HECTOR chapel()]$ ./drag --n=50000 --ns=2000
Execution Time = 2.47882
[ahmad@HECTOR chapel()]$ chpl --version
chpl version 2.4.0
built with LLVM version 14.0.0
available LLVM targets: m68k, xcore, x86-64, x86, wasm64, wasm32, ve, systemz, sparcel, sparcv9, sparc, riscv64, riscv32, ppc64le, ppc64, ppc32le, ppc32, nvptx64, nvptx, msp430, mips64el, mips64, mipsel, mips, lanai, hexagon, bpfeb, bpfel, bpf, avr, thumbeb, thumb, armeb, arm, amdgcn, r600, aarch64_32, aarch64_be, aarch64, arm64_32, arm64
Copyright 2020-2025 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)
My CPU: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
I have an 8 core/16 thread Intel 10700k and found that I could squeeze slightly more performance out of the original Chapel code by setting CHPL_RT_NUM_THREADS_PER_LOCALE=16
, consistent with your findings that the timing results were better when setting this variable to 2x the number of cores for processors that support hyper-threading.
[ahmad@HECTOR chapel()]$ CHPL_RT_NUM_THREADS_PER_LOCALE=16 ./drag --n=50000 --ns=2000
Execution Time = 1.68287
I'm curious why I cannot replicate your Fortran results on my machine. Given that I don't do any Fortran compilation, I tried reproducing a working command based on your pasted build log, but did not see any improvement from using /Qparallel
(or other options I tried) in Windows.
C:\Users\Ahmad\git\fortran>ifx dragf.f90 -O3 /Qparallel -o dragf
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.1.0 Build 20250317
Copyright (C) 1985-2025 Intel Corporation. All rights reserved.
Microsoft (R) Incremental Linker Version 14.43.34810.0
Copyright (C) Microsoft Corporation. All rights reserved.
-out:dragf.exe
-subsystem:console
dragf.obj
C:\Users\Ahmad\git\fortran>dragf.exe
RFU(1), RFU(nn): -0.183924604199142 -0.199632208309082
Wall clock time: 10.1850004196167 seconds
I installed intel fortran compiler on ubuntu, and ran the same drag.f90
test, it has 11.3 secs execution time, close enough to what I have running
under Windows, even slightly faster.
...
Important conclusion - under Windows and ubuntu sequential runs take the
same time.
Can you provide the minimal set of compiler options that can reproduce your results where you're seeing the speedup in the Fortran code between compilations in Windows?