Branch: refs/heads/main
Revision: 1285226
Author: ronawho
Log Message:
Merge pull request #17978 from ronawho/opt-allLocalesBarrier
Optimize allLocalesBarrier by reducing communication
[reviewed by @e-kayrakli]
The allLocalesBarrier
was previously implemented using a distributed
field in a class. This was hitting the performance issue in #10160 where
any access of the distributed field was doing a GET of the class
instance first, which resulted in all remote tasks doing a GET from
locale 0. This really hurt performance, especially on InfiniBand systems
where communication injection is serialized.
This moves the distributed field out of the class into a global, which
for this case is fine since the allLocalesBarrier is a singleton global
anyways. This significantly improves the performance of the barrier by
eliminating needless communication. A comm test is added to lock in that
behavior too.
Comparing performance for performance/comm/barrier/empty-chpl-barrier
with 100,000 trials we significant improvements at small scale for
InfiniBand and non-trivial improvements for Aries. The following results
are on 16 nodes with 40 cores each:
config Aries IB before 3.6s 61.4s after 2.8s 4.4s
And on 512 nodes of a different Aries system with 36 cores per node:
config Aries before 89.2s after 5.1s
So at small scale we see large benefits for InfiniBand and on Aries at
large scale we also see big improvements. There's 2 factors here, on IB
communication is serialized, which makes the impact more dramatic and
even on Aries which has fast concurrent comm the all-to-one behavior
becomes a bottleneck at scale.
Modified Files:
M modules/packages/AllLocalesBarriers.chpl
M modules/standard/Barriers.chpl
M test/parallel/taskPar/sungeun/barrier/commDiags.chpl
M test/parallel/taskPar/sungeun/barrier/commDiags.comm-none.good
M test/parallel/taskPar/sungeun/barrier/commDiags.good
M test/parallel/taskPar/sungeun/barrier/commDiags.na-none.good
Compare: https://github.com/chapel-lang/chapel/compare/75a0eeea870e...128522633ad6