These are preliminary results of running a few benchmarks on a dual-rail IB system. Each data point was collected on a two-node cluster, each node containing two AMD Genoa processors, and CHPL_GASNET_SEGMENT=fast
Column
1
2
3
4
5
6
7
8
9
10
Co-locales | 1 | 1 | 1s[^1] | 1s[^1] | 2 | 2 | 2 | 2 | 2 | 2 |
NICs | 1 | 2 | 1 | 2 | 1 | 2 | 2s[^2] | 1 | 2 | 2s[^2] |
PSHM | No | No | No | No | No | No | No | Yes | Yes | Yes |
Raw bandwidth experiments (test/performance/comm/low-level/simpleBandwidth.chpl):
NICS
PUT (GB/s)
GET (GB/s)
1| 25 | 39 |
2| 46 | 75 |
The PUTs are significantly slower than the GETs (not sure why), so we would expect indexgather go no faster than the PUT numbers.
Discussion
It's difficult to make sense of these numbers. The stream results seem reasonable -- the benchmark does not do any significant communication so the number of NICs does not affect performance. There is a 25% performance improvement going from one to two co-locales per node, due to the benefit of not crossing the socket boundary.
The indexgather numbers are more mystifying. In the benchmark, Locale 0 copies random entries from a block-distributed array into a local array. Source aggregation is used, meaning Locale 0 aggregates the indices of the entries it wants from each remote locale, sends the aggregated indices to each remote locale, receives the values from each remote locale in a single response, and disperses the values into the temporary array at the proper locations. Thus, assuming the indices and values are the same width (need to check on this), Locale 0 is doing PUTs and GETs of roughly the same amount of data to each of the remote locales. There is a limit to the aggregation buffer size, so once the initial aggregation buffer is full the PUTs and GETs will be overlapped.
Looking at the columns from left to right, with one locale per node, performance increases from 34 GB/s to 64 GB/s going from one to two NICs. This is lower than the raw PUT performance presumably due to various overheads in indexgather, including that it must do some computation and copying to create the aggregated indices and disperse the values when they are received. Note that confining the locale to a single socket does not decrease performance with a single NIC (columns 1 vs. 3), further implying that the NIC is the bottleneck. Performance does not improve with a second NIC (column 4), which is a surprise. It stays flat at 34 GB/s so either there is something is wrong with the experiment or the single NIC bandwidth is coincidently the same as the single socket bandwidth, which seems unlikely.
Performance with two co-locales sharing a NIC is 50 GB/s (column 5), which corresponds to the raw bandwidth for a single NIC. However, performance for two co-locales each with its own NIC increases to only 54 GB/s (column 6). I would expect it to be closer to twice the single-socket single-NIC performance (i.e., 68 GB/s). Performance drops slightly to 50 GB/s if the two co-locales share the two NICs (column 7).
PSHM performance is also unexpected (columns 8-10). With two co-locales on each of two-nodes, I would expect 1/3 of Locale 0's traffic to be to the co-locale on the same node and thus go through PSHM. That should result in a significant performance increase; instead, performance is slightly worse than the non-PSHM configurations.
The bottom row (ra-rmo) contains performance numbers for a benchmark that does random read-modify-writes to a distributed array of 8-byte integers. Performance with a single locale per node is relatively flat, with a slight increase when the locale is confined to a single socket (columns 1-4). Performance with two co-locales per nodes is flat (columns 5-8). PSHM performance has yet to be measured.