Chapel and C/C++ performance difference

We developed a new connected components graph algorithm, Contour. Compared with the existing FastSV algorithm implemented in Chapel, the Contour algorithm can achieve about 2-8 times speedup. We also implemented the Contour algorithm in the GBBS graph package. The speedup is about 2x speedup. However, the execution time in C++ is much shorter. We hope to find the reason why Chapel has a big overhead.

The Chapel comparison in Arkouda is as follows.

1 building the Chapel environment, version >=1.29.0
2 downloading the latest arkouda package
3 downlowing arkouda-njit package (git@github.com:Bears-R-Us/arkouda-njit.git) under ther arkouda directory; renaming arkouda-njit as arkouda_njit
4 compiling arkouda-njit
enter arkouda/arkouda_njit directory
python module_configuration.py --ak_loc=/complete/path/to/arkouda/ --pkg_path=/complete/path/to/arkouda_njit/arachne_development/ | bash
5 under the arkouda directory, start arkouda server
./arkouda -nl 1 --ServerConfig.ServerPort=5050 2>&1|tee -a output.text
6 executing the comparison program under the directory arkouda_njit/arachne_development/test/client/benchmarks
./runclient.sh batch_cc.py machinename 5050

in batch_cc.py, we can just keep the delaunay_n20.mtx graph and remove all other graphs. we need to download the graph file and change the directory to its location.
here is the output of FastSV (fast sv cc) and our Contour (fs dist cc) results in output.text.
2023-06-08:09:39:09 [CCMsg] segCCMsg Line 2404 DEBUG [Chapel] Time elapsed for fast sv cc: 0.144295
2023-06-08:09:39:09 [CCMsg] segCCMsg Line 2418 DEBUG [Chapel] Time elapsed for fs dist cc: 0.017734

The C++ gbbs package based comparison is as follows.

1 downloading gbbs
git clone git@github.com:zhihuidu/gbbs.git
2 building gbbs
cd gbbs
git submodule update --init
bazel build //...
3 generating the conver tool
cd gbs/utils
make
./snap_converter -i edglistfile -o adjfile
given edge list file as input , output the adjacency list file for gbbs.
4 run the test
numactl -i all ./bazel-bin/benchmarks/Connectivity/ConnectIt/mains/shiloach_vishkin -rounds 3 -s -m -r 10 ./inputs/delaunay_n20.adj
it will use the input file delaunay_n20.adj as input graph, run the algorithm shiloach_vishkin
the average execution time is
{
"test_type": "static_connectivity_result",
"test_name" : "shiloach_vishkin; no_sampling",
"graph" : "./inputs/delaunay_n20.adj",
"rounds" : 10,
"medt" : 0.02108,
"mint" : 0.020909,
"maxt" : 0.021629
}

the following command will run our contour algorithm.
numactl -i all ./bazel-bin/benchmarks/Connectivity/ConnectIt/mains/contour -rounds 3 -s -m -r 10 ./inputs/delaunay_n20.adj

The no_sampling results are as follows.
{
"test_type": "static_connectivity_result",
"test_name" : "contour; no_sampling",
"graph" : "./inputs/delaunay_n20.adj",
"rounds" : 10,
"medt" : 0.013881,
"mint" : 0.013434,
"maxt" : 0.014609
}

The system information is as follows.

uname -a
Linux alex 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/cpuinfo the last processor informaiton is as follows.

processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 151
model name : 12th Gen Intel(R) Core(TM) i9-12900KS
stepping : 2
microcode : 0x2c
cpu MHz : 3400.000
cache size : 30720 KB
physical id : 0
siblings : 24
core id : 39
cpu cores : 16
apicid : 78
initial apicid : 78
fpu : yes
fpu_exception : yes
cpuid level : 32
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb
bogomips : 6835.20
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
cat /proc/meminfo as follows.
cat /proc/meminfo
MemTotal: 32606972 kB
MemFree: 29801908 kB
MemAvailable: 31092068 kB
Buffers: 101636 kB
Cached: 1503568 kB
SwapCached: 0 kB
Active: 812976 kB
Inactive: 1249784 kB
Active(anon): 2596 kB
Inactive(anon): 454284 kB
Active(file): 810380 kB
Inactive(file): 795500 kB
Unevictable: 32 kB
Mlocked: 32 kB
SwapTotal: 2097148 kB
SwapFree: 2097148 kB
Zswap: 0 kB
Zswapped: 0 kB
Dirty: 4 kB
Writeback: 0 kB
AnonPages: 457856 kB
Mapped: 326920 kB
Shmem: 11356 kB
KReclaimable: 105424 kB
Slab: 405600 kB
SReclaimable: 105424 kB
SUnreclaim: 300176 kB
KernelStack: 13104 kB
PageTables: 16904 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18400632 kB
Committed_AS: 3935208 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 181992 kB
VmallocChunk: 0 kB
Percpu: 26208 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 366260 kB
DirectMap2M: 6709248 kB
DirectMap1G: 26214400 kB

Hi Dr. Du, I have started looking into this and am confused about the data files in the batch_cc.py file. This file uses a path hardcoded to your home machine and I believe that I should have those files downloaded from somewhere in order to be able to run the benchmark.

Is there somewhere that I can obtain those files?

One other question: could you provide the output from running printchplenv in your environment?

I am glad to know that you are looking into it. Sorry for the confusion in the batch_cc.py. You are right. You need to change the path based on your directory. You can download the graph file here,
https://suitesparse-collection-website.herokuapp.com/MM/DIMACS10/delaunay_n20.tar.gz
This is the delaunay_n20 graph file. Of course, you can also use other mtx graph file. However, you need to : (1) delete all the comments in the original mtx file. (2) remove the first line
1048576 1048576 3145686
which gives the total number of vertices 1048576 and total number of edges 3145686 (3) convert the edge list file to adj file by executing
./snap_converter -i edglistfile -o adjfile

It's highly possible I missed something in the document and this may cause you cannot run it correctly. So, please just let me know when you have any problems with this test.

Please see the results of printchplenv

(arkouda-dev) duzh@alex:~/arkouda$ printchplenv
machine info: Linux alex 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64
CHPL_HOME: /home/duzh/chapel-1.29.0 *
script location: /home/duzh/chapel-1.29.0/util/chplenv
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
CHPL_COMM_SUBSTRATE: udp *
CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: qthreads *
CHPL_LAUNCHER: amudprun *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc *
CHPL_ATOMICS: cstdlib
CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled *
CHPL_HWLOC: bundled *
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none

Thanks for the quick response! It looks like you have CHPL_COMM=gasnet set, which is going to cause your single locale (i.e., running with -nl 1) programs to run much slower. Could you rebuild your chpl compiler with CHPL_COMM=none set, recompile Arkouda, and then regather these numbers? I would expect a significant improvement in performance with just that change.

That should look something like:

  1. export CHPL_COMM=none
  2. cd $CHPL_HOME
  3. make -j 8
  4. recompile Arkouda/Arkouda NJIT

Based on the system information above, it doesn't look like you are running on a distributed system, so I would not recommend setting CHPL_COMM=gasnet except for very specific situations (which does not include performance runs). In general, CHPL_COMM should only be set to gasnet when trying to run multi-locale Chapel programs, so you may want to modify your .bashrc or however that variable is being set in your environment to avoid having that set automatically.

One other environment-related thing to check would be to ensure that ARKOUDA_QUICK_COMPILE or ARKOUDA_DEVELOPER are not set, as these will disable optimizations if set when building Arkouda, which would cause much slower performance.

1 Like

Thanks and I will recompile it and check the performance.

My computer is overclocked and I disabled the turbo boost. I retested them, and please see the following results.
For shared memory configuration CHPL_COMM=none,

2023-06-17:11:43:58 [CCMsg] segCCMsg Line 2390 DEBUG [Chapel] Time elapsed for fast sv dist cc: 0.059391
2023-06-17:11:43:58 [CCMsg] segCCMsg Line 2418 DEBUG [Chapel] Time elapsed for fs dist cc: 0.00659

For multilocale configuration CHPL_COMM=gasnet
2023-06-17:09:52:10 [CCMsg] segCCMsg Line 2390 DEBUG [Chapel] Time elapsed for fast sv dist cc: 0.201416
2023-06-17:09:52:11 [CCMsg] segCCMsg Line 2418 DEBUG [Chapel] Time elapsed for fs dist cc: 0.025898

The speedup is about 0.025895/0.00659=3.929 and 0.201416/0.059391=3.391

Yes, just as you said before, the performance has significant improvement. Thanks for letting me know this.

Great, glad to hear that helped out. So, what is the Chapel vs C++ comparison at this point? Is Chapel still behind C++? I have still not been able to run this myself; it seems that the batch_cc.py file requires many files, not just the one that you provided the installation for and I am also unsure of what the snap_converter program is that you are executing and wasn't able to easily figure that out with a google search.

Are you running your test with all of the lines deleted except for the [3145686,1048576,2,0,HomeDir+"Adata/Delaunay/delaunay_n20/delaunay_n20.mtx.pr"] line, or are you running with all of the files? Can you provide a link for the snap_converter program?

The gbbs package performance in C++ for the same input graph is as follows.
{
"test_type": "static_connectivity_result",
"test_name" : "contour; no_sampling",
"graph" : "./inputs/graphs/edge_streams_coo/delaunay_n20.adj",
"rounds" : 10,
"medt" : 0.017801,
"mint" : 0.01715,
"maxt" : 0.019555
}
The average time is 0.017801. Although Chapel is a little bit worse than gbbs, it is very close to the c++ performance.
Yes, batch_cc.py will test many input graphs. You can just keep anyone you want to test. Currently, I only test delaunay_n20 to check the performance results.

Here is the converter program gbbs/utils/snap_converter.cc at master · zhihuidu/gbbs · GitHub
You just need to run
make
under the utils directory to generate the program.

Please let me know if you have any questions about the test. Thanks!

Though you seemed reasonably satisfied with where this stands after changing to CHPL_COMM=none, I decided to put a bit more time into it to see whether I could reproduce your results. I'm about to time out on this effort, but wanted to capture what I was and was not able to accomplish:

  1. I was not able to build the C++ code, seeing errors related to pathing (I believe paths relative to your machine, such as bridge.h:9:10: fatal error: parlay/delayed_sequence.h: No such file or directory)
  2. I was not able to preprocess the data using snap_converter due to similar build issues
  3. I was able to run the Chapel code using only the delaunay_n20.mtx file, though, I'm unsure if it is the correct output, given the lack of preprocessing
  4. I was unable to determine if I was running in the same way because I had to modify the scripts in order to get things to run

So, I have some Chapel numbers, but I am unsure if they were collected correctly and am also unsure how to interpret the data and am unable to collect corresponding C++ data to compare against.

1 About the C++ package compiling error, please check if you have executed this step
git submodule update --init
After you git clone the gbbs package , you need to enter the gbbs directory and run the above git command to download the paylay package. I could not compile it either before and then I found that I missed the above step.

2 I guess it is the same reason.

3 If you use mtx file, please replace the
njit.graph_file_read in for example

with
njit.graph_file_read_mtx
I preprocess the mtx files and only include the edge list. You need to use graph_file_read_mtx if you directly use the mtx input file.

4 I think it's ok to change the script.

Hi, Ben, Please see the following latest testing results. After accepting your suggestion, now I have achieved much better results.
I cannot upload the file so I just copy the results for your reference here.

Graph Name Graph ID Number of Edges Number of Vertices Chapel Implementation(s) gbbs package-C++(s) slowdown
ca-GrQc 0 28980 5242 0.000264 0.000324 0.814814815
ca-HepTh 1 51971 9877 0.000323 0.000397 0.813602015
facebook_combined 2 88234 4039 0.000243 5.60E-05 4.339285714
wiki 3 103689 8277 0.000339 0.000175 1.937142857
as-caida20071105 4 106762 26475 0.000536 0.00015 3.573333333
ca-CondMat 5 186936 23133 0.000583 0.000658 0.886018237
ca-HepPh 6 237010 12008 0.000567 0.0006145 0.922701383
email-Enron 7 367662 36692 0.000651 0.0006645 0.979683973
ca-AstroPh 8 396160 18772 0.000534 0.0007255 0.736044108
loc-brightkite_edges 9 428156 58228 0.000975 0.00056 1.741071429
soc-Epinions1 10 508837 75879 0.001538 0.000789 1.949302915
com-dblp 11 1049866 317080 0.008095 0.0014275 5.670753065
com-youtube 12 2987624 1134890 0.025637 0.0035885 7.144210673
amazon0601 13 2443408 403394 0.008375 0.003014 2.778699403
soc-LiveJournal1 14 68993773 4847571 0.20743 0.040581 5.111505384
higgs-social_network 15 14855842 456626 0.024719 0.004915 5.029298067
com-orkut 16 117185083 3072441 0.262083 0.027512 9.526134051
road_usa 17 28854312 23947347 0.193439 0.076205 2.538402992
kmer_A2a 18 180292586 170728175 2.09394 0.54448 3.845761093
kmer_V1r 19 232705452 214005017 2.7466 0.61255 4.483878867
uk-2002 20 298113762 18520486 0.369062 0.56479 0.653449955
delaunay_n10 21 3056 1024 0.000109 8.90E-05 1.224719101
delaunay_n11 22 6127 2048 0.000109 0.000109 1
delaunay_n12 23 12264 4096 0.000172 0.000123 1.398373984
delaunay_n13 24 24547 8192 0.000202 0.000137 1.474452555
delaunay_n14 25 49122 16384 0.000287 0.000184 1.559782609
delaunay_n15 26 98274 32768 0.000462 0.0002745 1.683060109
delaunay_n16 27 196575 65536 0.000876 0.000438 2
delaunay_n17 28 393176 131072 0.001228 0.000639 1.921752739
delaunay_n18 29 786396 262144 0.00245 0.000929 2.637244349
delaunay_n19 30 1572823 524288 0.003442 0.0018515 1.859033216
delaunay_n20 31 3145686 1048576 0.00864 0.0034375 2.513454545
delaunay_n21 32 6291408 2097152 0.014191 0.006838 2.075314419
delaunay_n22 33 12582869 4194304 0.033228 0.01365 2.434285714
delaunay_n23 34 25165784 8388608 0.065764 0.027456 2.395250583
delaunay_n24 35 50331601 16777216 0.142013 0.056348 2.52028466
average= 2.615891748