[Chapel Merge] Simplify locale setup barrier

Branch: refs/heads/main
Revision: f121ce2
Author: ronawho
Link: Simplify locale setup barrier by ronawho · Pull Request #18868 · chapel-lang/chapel · GitHub
Log Message:

Merge pull request #18868 from ronawho/simplify-locale-barrier

Simplify locale setup barrier

[reviewed by @bradcray @daviditen @e-kayrakli]

We need to barrier while setting up locale data structures. Previously,
we were doing this with a custom barrier added in 340ce913d6. The
barrier was not highly optimized partially due to being called early in
startup when not all features are available, and we couldn't just use
chpl_comm_barrier since it's already in use. This barrier was also
racey because it PUT wide class pointers to locale 0 while locale 0 was
reading to see if the classes are non-nil. I don't know that this has
bitten us in the past, but it's wasn't ideal.

Instead of trying to free up chpl_comm_barrier() (which would be nice
but has not been easy) or trying to improve the existing barrier, just
change the callsite to use 2 coforall+ons and rely on the implicit
barrier at join. This adds an extra coforall+on, but that pattern is
already highly optimized since it's used all over. A comm barrier may
ultimately be a little more efficient, but this approach is much faster
at least on platforms that don't support network atomics. Here's 512
node XC timings for helpSetupRootLocaleFlat():

Config Before After
--network-atomics=ugni 0.09s 0.09s
--network-atomics=none 4.13s 0.09s

I'm actually quite surprised how slow the processor atomic version was
before. I suspect this is mostly due to the testAndSet used to release
remote nodes being serial and blocking (and creating a remote task since
it's fetching), but I'm still surprised by just how much overhead that
adds. We could have changed the testAndSet to a write since the
result isn't used or performed the updates in parallel, but I don't
expect any of those to be any faster. This new version seems plenty fast
and gets rid of some crufty and racey code.

Modified Files:
R test/modules/sungeun/init/printInitCommCounts.lm-numa.good

M modules/internal/ChapelLocale.chpl
M test/functions/operatorOverloads/operatorMethods/genericsInstantiationBad.good
M test/modules/sungeun/init/printInitCommCounts.good
M test/visibility/except/operatorsExceptions/exceptGreaterThan.good
M test/visibility/except/operatorsExceptions/exceptGreaterThanOrEqual.good
M test/visibility/except/operatorsExceptions/exceptLessThan.good
M test/visibility/except/operatorsExceptions/exceptLessThanOrEqual.good

Compare: https://github.com/chapel-lang/chapel/compare/5430b6ac1f00...f121ce23f05a