I am trying to test CHAPEL performance having in mind potential use for nuclear engineering safety analysis codes (NUC), or real-time training simulation codes.
The motivation is coming from the following rationale. NUC codes typically use just one computer processor, the parallelization efforts are limited to course-grained parallelization using either MPI or proprietary tools, in such case the executables are split into, typically, about 5 executables running in parallel and getting data exchanged and synchronized.
My work laptop has 16 cores, so I anticipate that few years from now commonly affordable desktops will have >64 cores. Effective utilization of those has potential of 100 times performance boost, which sounds good, and is promising from business perspective (competitiveness, profit).
I installed ubuntu 24.04.2 running under MS Windows 11 Pro, having Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz 2.59 GHz. The machine has 1 GPU, 6 cores and 12 logical processors. CHAPEL was built with gpu support.
drag.chpl (1.6 KB)
dragg.chpl (2.5 KB)
In the uploaded tests I was trying to use typical data needed and calculation performed (in small piece, of course), the number of nodes used is n=20000, and number of steps ns=1000.
By doing:
chpl --fast drag.chpl
./drag --n=20000 --ns=1000
I am getting execution time of about 3 sec
gragg (run on gpu) gives execution time of about 2 sec.
I got the best timing when
export CHPL_RT_NUM_THREADS_PER_LOCALE=12
meaning, when the number of threads are set to the number of logical processors.
By changing the above export I see time change, and also it is clear from Windows Task Manager that all machine processors are in use.
Since the legacy code I am aiming to parallelize is written in old-good Fortran, I wrote similar test attached which was compiled using Intel Fortran.
drag-f90.txt (2.9 KB)
Please note that I added random multipliers to avoid 'too smart' compiler optimization.
The execution time is about 0.4 sec.
./drag --n=50000 --ns=2000
gives Execution time 16.2 sec
./dragg--n=50000 --ns=2000
gives Execution time 5.7 sec
Fortran execution time for the same is 2.12 sec
Any thoughts, clues ???
Help would be appreciated greatly !!!