Chapel and C/C++ performance difference

zhihuidu · June 8, 2023, 8:57pm

We developed a new connected components graph algorithm, Contour. Compared with the existing FastSV algorithm implemented in Chapel, the Contour algorithm can achieve about 2-8 times speedup. We also implemented the Contour algorithm in the GBBS graph package. The speedup is about 2x speedup. However, the execution time in C++ is much shorter. We hope to find the reason why Chapel has a big overhead.

The Chapel comparison in Arkouda is as follows.

1 building the Chapel environment, version >=1.29.0
2 downloading the latest arkouda package
3 downlowing arkouda-njit package (git@github.com:Bears-R-Us/arkouda-njit.git) under ther arkouda directory; renaming arkouda-njit as arkouda_njit
4 compiling arkouda-njit
enter arkouda/arkouda_njit directory
python module_configuration.py --ak_loc=/complete/path/to/arkouda/ --pkg_path=/complete/path/to/arkouda_njit/arachne_development/ | bash
5 under the arkouda directory, start arkouda server
./arkouda -nl 1 --ServerConfig.ServerPort=5050 2>&1|tee -a output.text
6 executing the comparison program under the directory arkouda_njit/arachne_development/test/client/benchmarks
./runclient.sh batch_cc.py machinename 5050

in batch_cc.py, we can just keep the delaunay_n20.mtx graph and remove all other graphs. we need to download the graph file and change the directory to its location.
here is the output of FastSV (fast sv cc) and our Contour (fs dist cc) results in output.text.
2023-06-08:09:39:09 [CCMsg] segCCMsg Line 2404 DEBUG [Chapel] Time elapsed for fast sv cc: 0.144295
2023-06-08:09:39:09 [CCMsg] segCCMsg Line 2418 DEBUG [Chapel] Time elapsed for fs dist cc: 0.017734

The C++ gbbs package based comparison is as follows.

1 downloading gbbs
git clone git@github.com:zhihuidu/gbbs.git
2 building gbbs
cd gbbs
git submodule update --init
bazel build //...
3 generating the conver tool
cd gbs/utils
make
./snap_converter -i edglistfile -o adjfile
given edge list file as input , output the adjacency list file for gbbs.
4 run the test
numactl -i all ./bazel-bin/benchmarks/Connectivity/ConnectIt/mains/shiloach_vishkin -rounds 3 -s -m -r 10 ./inputs/delaunay_n20.adj
it will use the input file delaunay_n20.adj as input graph, run the algorithm shiloach_vishkin
the average execution time is
{
"test_type": "static_connectivity_result",
"test_name" : "shiloach_vishkin; no_sampling",
"graph" : "./inputs/delaunay_n20.adj",
"rounds" : 10,
"medt" : 0.02108,
"mint" : 0.020909,
"maxt" : 0.021629
}

the following command will run our contour algorithm.
numactl -i all ./bazel-bin/benchmarks/Connectivity/ConnectIt/mains/contour -rounds 3 -s -m -r 10 ./inputs/delaunay_n20.adj

The no_sampling results are as follows.
{
"test_type": "static_connectivity_result",
"test_name" : "contour; no_sampling",
"graph" : "./inputs/delaunay_n20.adj",
"rounds" : 10,
"medt" : 0.013881,
"mint" : 0.013434,
"maxt" : 0.014609
}

The system information is as follows.

uname -a
Linux alex 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/cpuinfo the last processor informaiton is as follows.

processor : 23
vendor_id : GenuineIntel
cpu family : 6
model : 151
model name : 12th Gen Intel(R) Core(TM) i9-12900KS
stepping : 2
microcode : 0x2c
cpu MHz : 3400.000
cache size : 30720 KB
physical id : 0
siblings : 24
core id : 39
cpu cores : 16
apicid : 78
initial apicid : 78
fpu : yes
fpu_exception : yes
cpuid level : 32
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb
bogomips : 6835.20
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
cat /proc/meminfo as follows.
cat /proc/meminfo
MemTotal: 32606972 kB
MemFree: 29801908 kB
MemAvailable: 31092068 kB
Buffers: 101636 kB
Cached: 1503568 kB
SwapCached: 0 kB
Active: 812976 kB
Inactive: 1249784 kB
Active(anon): 2596 kB
Inactive(anon): 454284 kB
Active(file): 810380 kB
Inactive(file): 795500 kB
Unevictable: 32 kB
Mlocked: 32 kB
SwapTotal: 2097148 kB
SwapFree: 2097148 kB
Zswap: 0 kB
Zswapped: 0 kB
Dirty: 4 kB
Writeback: 0 kB
AnonPages: 457856 kB
Mapped: 326920 kB
Shmem: 11356 kB
KReclaimable: 105424 kB
Slab: 405600 kB
SReclaimable: 105424 kB
SUnreclaim: 300176 kB
KernelStack: 13104 kB
PageTables: 16904 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18400632 kB
Committed_AS: 3935208 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 181992 kB
VmallocChunk: 0 kB
Percpu: 26208 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 366260 kB
DirectMap2M: 6709248 kB
DirectMap1G: 26214400 kB

BenMcD · June 15, 2023, 9:39pm

Hi Dr. Du, I have started looking into this and am confused about the data files in the batch_cc.py file. This file uses a path hardcoded to your home machine and I believe that I should have those files downloaded from somewhere in order to be able to run the benchmark.

Is there somewhere that I can obtain those files?

BenMcD · June 15, 2023, 9:46pm

One other question: could you provide the output from running printchplenv in your environment?

zhihuidu · June 15, 2023, 10:27pm

I am glad to know that you are looking into it. Sorry for the confusion in the batch_cc.py. You are right. You need to change the path based on your directory. You can download the graph file here,
https://suitesparse-collection-website.herokuapp.com/MM/DIMACS10/delaunay_n20.tar.gz
This is the delaunay_n20 graph file. Of course, you can also use other mtx graph file. However, you need to : (1) delete all the comments in the original mtx file. (2) remove the first line
1048576 1048576 3145686
which gives the total number of vertices 1048576 and total number of edges 3145686 (3) convert the edge list file to adj file by executing
./snap_converter -i edglistfile -o adjfile

It's highly possible I missed something in the document and this may cause you cannot run it correctly. So, please just let me know when you have any problems with this test.

zhihuidu · June 15, 2023, 10:38pm

Please see the results of printchplenv

(arkouda-dev) duzh@alex:~/arkouda$ printchplenv
machine info: Linux alex 5.19.0-43-generic #44~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon May 22 13:39:36 UTC 2 x86_64
CHPL_HOME: /home/duzh/chapel-1.29.0 *
script location: /home/duzh/chapel-1.29.0/util/chplenv
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
CHPL_COMM_SUBSTRATE: udp *
CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: qthreads *
CHPL_LAUNCHER: amudprun *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc *
CHPL_ATOMICS: cstdlib
CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled *
CHPL_HWLOC: bundled *
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none

BenMcD · June 15, 2023, 10:57pm

Thanks for the quick response! It looks like you have CHPL_COMM=gasnet set, which is going to cause your single locale (i.e., running with -nl 1) programs to run much slower. Could you rebuild your chpl compiler with CHPL_COMM=none set, recompile Arkouda, and then regather these numbers? I would expect a significant improvement in performance with just that change.

That should look something like:

export CHPL_COMM=none
cd $CHPL_HOME
make -j 8
recompile Arkouda/Arkouda NJIT

Based on the system information above, it doesn't look like you are running on a distributed system, so I would not recommend setting CHPL_COMM=gasnet except for very specific situations (which does not include performance runs). In general, CHPL_COMM should only be set to gasnet when trying to run multi-locale Chapel programs, so you may want to modify your .bashrc or however that variable is being set in your environment to avoid having that set automatically.

One other environment-related thing to check would be to ensure that ARKOUDA_QUICK_COMPILE or ARKOUDA_DEVELOPER are not set, as these will disable optimizations if set when building Arkouda, which would cause much slower performance.

zhihuidu · June 16, 2023, 1:11am

Thanks and I will recompile it and check the performance.

zhihuidu · June 17, 2023, 4:02pm

My computer is overclocked and I disabled the turbo boost. I retested them, and please see the following results.
For shared memory configuration CHPL_COMM=none,

2023-06-17:11:43:58 [CCMsg] segCCMsg Line 2390 DEBUG [Chapel] Time elapsed for fast sv dist cc: 0.059391
2023-06-17:11:43:58 [CCMsg] segCCMsg Line 2418 DEBUG [Chapel] Time elapsed for fs dist cc: 0.00659

For multilocale configuration CHPL_COMM=gasnet
2023-06-17:09:52:10 [CCMsg] segCCMsg Line 2390 DEBUG [Chapel] Time elapsed for fast sv dist cc: 0.201416
2023-06-17:09:52:11 [CCMsg] segCCMsg Line 2418 DEBUG [Chapel] Time elapsed for fs dist cc: 0.025898

The speedup is about 0.025895/0.00659=3.929 and 0.201416/0.059391=3.391

Yes, just as you said before, the performance has significant improvement. Thanks for letting me know this.

BenMcD · June 20, 2023, 4:11pm

Great, glad to hear that helped out. So, what is the Chapel vs C++ comparison at this point? Is Chapel still behind C++? I have still not been able to run this myself; it seems that the batch_cc.py file requires many files, not just the one that you provided the installation for and I am also unsure of what the snap_converter program is that you are executing and wasn't able to easily figure that out with a google search.

Are you running your test with all of the lines deleted except for the [3145686,1048576,2,0,HomeDir+"Adata/Delaunay/delaunay_n20/delaunay_n20.mtx.pr"] line, or are you running with all of the files? Can you provide a link for the snap_converter program?

zhihuidu · June 20, 2023, 5:15pm

The gbbs package performance in C++ for the same input graph is as follows.
{
"test_type": "static_connectivity_result",
"test_name" : "contour; no_sampling",
"graph" : "./inputs/graphs/edge_streams_coo/delaunay_n20.adj",
"rounds" : 10,
"medt" : 0.017801,
"mint" : 0.01715,
"maxt" : 0.019555
}
The average time is 0.017801. Although Chapel is a little bit worse than gbbs, it is very close to the c++ performance.
Yes, batch_cc.py will test many input graphs. You can just keep anyone you want to test. Currently, I only test delaunay_n20 to check the performance results.

Here is the converter program gbbs/utils/snap_converter.cc at master · zhihuidu/gbbs · GitHub
You just need to run
make
under the utils directory to generate the program.

Please let me know if you have any questions about the test. Thanks!

BenMcD · June 21, 2023, 10:29pm

Though you seemed reasonably satisfied with where this stands after changing to CHPL_COMM=none, I decided to put a bit more time into it to see whether I could reproduce your results. I'm about to time out on this effort, but wanted to capture what I was and was not able to accomplish:

I was not able to build the C++ code, seeing errors related to pathing (I believe paths relative to your machine, such as bridge.h:9:10: fatal error: parlay/delayed_sequence.h: No such file or directory)
I was not able to preprocess the data using snap_converter due to similar build issues
I was able to run the Chapel code using only the delaunay_n20.mtx file, though, I'm unsure if it is the correct output, given the lack of preprocessing
I was unable to determine if I was running in the same way because I had to modify the scripts in order to get things to run

So, I have some Chapel numbers, but I am unsure if they were collected correctly and am also unsure how to interpret the data and am unable to collect corresponding C++ data to compare against.

zhihuidu · June 21, 2023, 10:53pm

1 About the C++ package compiling error, please check if you have executed this step
git submodule update --init
After you git clone the gbbs package , you need to enter the gbbs directory and run the above git command to download the paylay package. I could not compile it either before and then I found that I missed the above step.

2 I guess it is the same reason.

3 If you use mtx file, please replace the
njit.graph_file_read in for example

github.com

Bears-R-Us/arkouda-njit/blob/main/arachne_development/test/client/benchmarks/batch_cc.py#L80


      
                      [3145686,1048576,2,0,HomeDir+"Adata/Delaunay/delaunay_n20/delaunay_n20.mtx"],\
                      [6291408,2097152,2,0,HomeDir+"Adata/Delaunay/delaunay_n21/delaunay_n21.mtx"],\
                      [12582869,4194304,2,0,HomeDir+"Adata/Delaunay/delaunay_n22/delaunay_n22.mtx"],\
                      [25165784,8388608,2,0,HomeDir+"Adata/Delaunay/delaunay_n23/delaunay_n23.mtx"],\
                      [50331601,16777216,2,0,HomeDir+"Adata/Delaunay/delaunay_n24/delaunay_n24.mtx"],\
                      ]
              BigMtxFile=[ 
                      [180292586,170728175,2,0,HomeDir+"Adata/SNAP/kmer_A2a.mtx"],\
                      [232705452,214005017,2,0,HomeDir+"Adata/SNAP/kmer_V1r.mtx"],\
                      [298113762,18520486,2,0,HomeDir+"Adata/SNAP/uk-2002.mtx"],\
          #            [936364282,39459925,2,0,HomeDir+"Adata/SNAP/uk-2005.mtx"],\
                        ]
              MtxRGG=[ \
                      [14487995,2097152,2,0,HomeDir+"Adata/rgg_n_2/rgg_n_2_21_s0/rgg_n_2_21_s0.mtx"],\
                      [30359198,4194304,2,0,HomeDir+"Adata/rgg_n_2/rgg_n_2_22_s0/rgg_n_2_22_s0.mtx"],\
                      [63501393,8388608,2,0,HomeDir+"Adata/rgg_n_2/rgg_n_2_23_s0/rgg_n_2_23_s0.mtx"],\
                      [132557200,16777216,2,0,HomeDir+"Adata/rgg_n_2/rgg_n_2_24_s0/rgg_n_2_24_s0.mtx"],\
                        ]
              MtxKron=[ \
                      [2456398,65536,3,0,HomeDir+"Adata/kron_g500-logn/kron_g500-logn16/kron_g500-logn16.mtx"],\
                      [5114375,131072,3,0,HomeDir+"Adata/kron_g500-logn/kron_g500-logn17/kron_g500-logn17.mtx"],\

with
njit.graph_file_read_mtx
I preprocess the mtx files and only include the edge list. You need to use graph_file_read_mtx if you directly use the mtx input file.

4 I think it's ok to change the script.

zhihuidu · July 18, 2023, 4:34pm

Hi, Ben, Please see the following latest testing results. After accepting your suggestion, now I have achieved much better results.
I cannot upload the file so I just copy the results for your reference here.

Graph Name	Graph ID	Number of Edges	Number of Vertices	Chapel Implementation(s)	gbbs package-C++(s)	slowdown
ca-GrQc	0	28980	5242	0.000264	0.000324	0.814814815
ca-HepTh	1	51971	9877	0.000323	0.000397	0.813602015
facebook_combined	2	88234	4039	0.000243	5.60E-05	4.339285714
wiki	3	103689	8277	0.000339	0.000175	1.937142857
as-caida20071105	4	106762	26475	0.000536	0.00015	3.573333333
ca-CondMat	5	186936	23133	0.000583	0.000658	0.886018237
ca-HepPh	6	237010	12008	0.000567	0.0006145	0.922701383
email-Enron	7	367662	36692	0.000651	0.0006645	0.979683973
ca-AstroPh	8	396160	18772	0.000534	0.0007255	0.736044108
loc-brightkite_edges	9	428156	58228	0.000975	0.00056	1.741071429
soc-Epinions1	10	508837	75879	0.001538	0.000789	1.949302915
com-dblp	11	1049866	317080	0.008095	0.0014275	5.670753065
com-youtube	12	2987624	1134890	0.025637	0.0035885	7.144210673
amazon0601	13	2443408	403394	0.008375	0.003014	2.778699403
soc-LiveJournal1	14	68993773	4847571	0.20743	0.040581	5.111505384
higgs-social_network	15	14855842	456626	0.024719	0.004915	5.029298067
com-orkut	16	117185083	3072441	0.262083	0.027512	9.526134051
road_usa	17	28854312	23947347	0.193439	0.076205	2.538402992
kmer_A2a	18	180292586	170728175	2.09394	0.54448	3.845761093
kmer_V1r	19	232705452	214005017	2.7466	0.61255	4.483878867
uk-2002	20	298113762	18520486	0.369062	0.56479	0.653449955
delaunay_n10	21	3056	1024	0.000109	8.90E-05	1.224719101
delaunay_n11	22	6127	2048	0.000109	0.000109	1
delaunay_n12	23	12264	4096	0.000172	0.000123	1.398373984
delaunay_n13	24	24547	8192	0.000202	0.000137	1.474452555
delaunay_n14	25	49122	16384	0.000287	0.000184	1.559782609
delaunay_n15	26	98274	32768	0.000462	0.0002745	1.683060109
delaunay_n16	27	196575	65536	0.000876	0.000438	2
delaunay_n17	28	393176	131072	0.001228	0.000639	1.921752739
delaunay_n18	29	786396	262144	0.00245	0.000929	2.637244349
delaunay_n19	30	1572823	524288	0.003442	0.0018515	1.859033216
delaunay_n20	31	3145686	1048576	0.00864	0.0034375	2.513454545
delaunay_n21	32	6291408	2097152	0.014191	0.006838	2.075314419
delaunay_n22	33	12582869	4194304	0.033228	0.01365	2.434285714
delaunay_n23	34	25165784	8388608	0.065764	0.027456	2.395250583
delaunay_n24	35	50331601	16777216	0.142013	0.056348	2.52028466
					average=	2.615891748

Topic		Replies	Views
Announcing Chapel version 1.19! Announcements	0	235	March 22, 2019
Chapel on an open-science Nvidia GPU + ARM system Users	6	296	September 20, 2021
Announcing Chapel 1.27.0! Announcements	0	278	June 30, 2022
Announcing Chapel 1.26.0! Announcements	0	198	April 1, 2022
Announcing Chapel 1.21.0! Announcements	0	216	April 10, 2020

Chapel and C/C++ performance difference

Related Topics