For library stabilization / Chapel 2.0, we're looking improve and stabilize our …support for barriers. As part of this effort we're looking to combine the [Barriers](https://chapel-lang.org/docs/1.25/modules/standard/Barriers.html) module and the [AllLocalesBarriers](https://chapel-lang.org/docs/1.25/modules/packages/AllLocalesBarriers.html) module, which was introduced because of performance limitations in the original barriers module.
This is a meta-issue, where I want to go through the API, discuss the current limitations (primarily focused on performance issues), show some motivating use-cases, present some options for addressing performance limitations, and give some comparisons to other languages/libraries. Specific issues and interface choices will be split into their own issues, but I wanted something to provide the big picture. For internal devs, see also https://github.com/Cray/chapel-private/issues/1987 and https://github.com/Cray/chapel-private/issues/2693.
Barrier API:
---
The barrier API we have today is relatively simple, but it has some performance implications. The barrier docs have a decent introduction and simple example -- https://chapel-lang.org/docs/1.25/modules/standard/Barriers.html#barriers
Once you have a barrier object, the interface is pretty simple. You can either do a single-phase `barrier()` call or you can do a split-phase barrier with `wait()` and `notify()`. These are pretty standard and the wait/notify is simply an optimization to allow you to do some work while waiting for other tasks to arrive.
We also have `check` and `reset` calls. `check` just lets you ask if the barrier is complete and `reset` is a way to change the number of tasks participating in the barrier. Arguably instead of doing a `reset` you could just destroy the barrier and create a new one, but that's potentially expensive so I think `reset` is also an optimization of sorts. I'm not sure the name is particularly intuitive though. I don't personally know of any use cases for `check`, but you could imagine combining it with a split-phase barrier to see if you should keep doing some work on your own or if others have completed the barrier. I'm not sure if `check` has any implications for potential implementations (i.e. does it force you into specific implementations that might not be needed if you didn't support it.)
Beyond the main barrier API, we also have the all-locales barrier, which was added because the default barrier does not scale. The all locales barrier hooks into the runtime `chpl_comm_barrier()`. Hooking into the comm layer allows us to specialize for each network and I believe this is an important concept for a performant barrier. Since this is tied to the runtime and not something a user could do, I feel pretty strongly that the barrier is something we have to provide as part of the standard module (and not something a user could really provide themselves). In terms of API -- the all-locales barrier only supports the `reset` and `barrier` interface. It does not support `wait`/`notify`/`check`. I'm not positive, but I think adding wait/notify would be pretty straightforward in terms of the comm layer implementation. I'm less sure about check. In its current form the all locales barrier is a global singleton, which means it's really only suitable for use in SPMD codes.
Init API:
---
The main performance limiting issue in the barrier API is actually its initializer:
```chpl
proc init(numTasks: int, barrierType: BarrierType = BarrierType.Atomic, reusable: bool = true)
```
Because only a number of tasks is specified, this really constrains the implementation to having a global counter that lives on a single locale, where what you really want in distributed contexts is some sort of tree-based or dissemination barrier. If the initializer included which locales and how many tasks per locale will participate we can provide a pretty optimized implementation. Expanding that a little further, just knowing how many tasks per locale participate still limits you to some sort of shared counter within a node and you could optimize further if for instance each task had a unique ID.
Another interesting aspects of this initializer is that you can specify a `BarrierType`, which tells us if we're going to use atomics or syncs to implement the atomic. The reason we care is that atomics will spin wait and will likely have lower latency, but use more CPU and syncs will have higher latency but use less CPU.
I think keeping this concept about active waiting vs blocking is important in the longer term, but I'm not sure we should make the enum about the implementation details. I wonder if this could instead be a "waiting policy" enum or something more generic that describes the intent rather than the implementation. If we keep an enum of sorts to control this behavior, I think we should make it param. It being non-param currently requires dynamic dispatch to select the implementation and I think we need barriers to have as little overhead as possible. I also wonder whether we need this control for the initial stabilization effort. It would seem easiest to get rid of the waiting policy for now, stabilize, and then we can add it back in later as we need it and have time to optimize.
The last argument is whether the barrier is reusable. If you don't need a reusable barrier, we can optimize some work out. It's not obvious to me if we should split this out into a different type or just keep the param. Other languages (C++, Java) have different objects called latches to implement this non-reusable barrier.
Module API:
---
The module is named `Barriers`, which always strikes me as a bit odd. This was done because of https://github.com/chapel-lang/chapel/issues/6979, where the module used to be `Barrier` and we had a class named `Barrier`. Since then, we replaced the class with a record-wrapped class, but the record is still named `Barrier`. I think we've typically been going with lower case for records, so this is an exception to that rule.
Could we go back to a `Barrier` module with a `barrier` record at this point? Would it make more sense to put this into a more general `Collectives` module? I will note that a lot of codes just want barriers with no other collectives so to me it seems attractive to have a standalone module for the barrier implementation, but on the other hand this could just be a `from Collectives import barrier;` sort of import.
Motivating examples:
---
Here's some pseudocode for different barrier idioms derived from LCALS, NPB-FT, ISx (the original motivation for allLocalesBarrier), CHAMPS, and SSCA#2.
<details>
Local multi-task barrier (based on [LCALS SPMD TRAP_INT](https://github.com/chapel-lang/chapel/blob/ef67d827bdeed9258f52e5d3ef66729761cee636/test/release/examples/benchmarks/lcals/RunSPMDRawLoops.chpl#L568-L605))
```chpl
use Barriers;
const nTasks = here.maxTaskPar;
var bar = new Barrier(nTasks);
coforall tid in 0..#nTasks {
while notConverged() {
computeMyChunk(tid);
bar.barrier();
if tid == 0 then updateConvergInfo();
bar.barrier();
}
}
```
---
Distributed single-task singleton barrier (based on [NPB-FT](https://github.com/chapel-lang/chapel/blob/ef67d827bdeed9258f52e5d3ef66729761cee636/test/npb/ft/npadmana/DistributedFFT.chpl#L261)):
```chpl
use AllLocalesBarriers;
coforall loc in Locales do on loc {
forall iy in ySrc {
zPlan.execute(myplane[0, iy, zSrc.first]);
Dst[{iy..iy, ix..ix, zSrc}] = myplane[{0..0, iy..iy, zSrc}];
}
allLocalesBarrier.barrier();
forall (iy, iz) in {yDst, zSrc} {
xPlan.execute(Dst[iy, xDst.first, iz]);
}
}
```
---
Distributed equal multi-task singleton barrier (based on optimized [ISx](https://github.com/chapel-lang/chapel/blob/ef67d827bdeed9258f52e5d3ef66729761cee636/test/studies/isx/isx-hand-optimized.chpl#L157)):
```chpl
use AllLocalesBarriers;
allLocalesBarrier.reset(perBucketMultiply);
coforall loc in Locales do on loc {
coforall tid in 0..#perBucketMultiply {
exchangeKeys(tid, sendOffsets, bucketSizes, myBucketedKeys);
allLocalesBarrier.barrier();
countLocalKeys(tid, keysInMyBucket, myLocalKeyCounts);
}
}
```
---
Distributed non-equal multi-task singleton barrier (based on CHAMPS):
```chpl
use CustomAllLocalesBarrier;
var gridZones = newBlockArr(0..#n, zoneType);
var tasksPerLocale: [LocaleSpace] int;
coforall loc in Locales do on loc {
tasksPerLocale[loc.id] = gridZones.localSubdomain().size;
}
customAllLocalesBarrier.reset(tasksPerLocale);
// ...
```
---
Distributed single-task non-singleton barrier (based on [SSCA#2](https://github.com/chapel-lang/chapel/blob/ef67d827bdeed9258f52e5d3ef66729761cee636/test/release/examples/benchmarks/ssca2/SSCA2_Modules/SSCA2_kernels.chpl#L276)):
```chpl
use Barriers;
forall v in vertices do on vertex_domain.dist.idxToLocale(v) {
var bar = new Barrier(numLocales);
coforall loc in Locales do on loc {
while remainig () {
forall ... { }
bar.notify();
localWork();
bar.wait();
forall ... {}
}
}
}
```
</details>
Potential API:
---
Below are my thoughts on an interface that would allow a more performant implementation. I think we still want to support a simple interface for shared memory barriers so we should keep the numTasks interface, but add an additional interface that takes an array of ints that specifies which locales and how many tasks per locale will participate:
```chpl
/* Like the current API, not expected to be scalable, but simple to use for
* local barriers */
proc init (numTasks: int, ...) { }
/* Take an array of ints to specify how many tasks will be participating per
* locale. 0 means exclude that locale */
proc init (participants: [] int, ...) { }
```
Or instead of 2 initializers should these be different types like LocalBarrier vs DistributedBarrier. This still limits the within node barrier performance since that's still just a count, which necessitates a global counter implementation. I think this can be improved with `barrier` and probably `wait`/`notify` overloads that take an id.
Other languages:
---
I did some comparisons with other languages and programming models. Generally speaking this doesn't given us an apples-to-apples comparison because they're either shared memory only (C++, JAVA), or SPMD (UPC/MPI) so their concerns and design choices are different.
C++ and Java provide non-reusable barriers they call a latch:
- https://en.cppreference.com/w/cpp/thread/latch
- https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CountDownLatch.html
The reusable barriers are called `barrier` and `CyclicBarrier` (where cyclic just denotes reusable):
- https://en.cppreference.com/w/cpp/thread/barrier
- https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CyclicBarrier.html
MPI and UPC just have function calls, there's no barrier object to create.
The APIs are pretty similar to the one we provide, though there are some differences especially for method names. Here's a function name comparison across languages:
| Chapel | C++ | Java | UPC | MPI |
| ------- | --------------- | ---------------- | ----------- | ------------ |
| barrier | arrive_and_wait | await | upc_barrier | MPI_Barrier |
| notify | arrive | N/A | upc_notify | MPI_Ibarrier |
| wait | wait | N/A | upc_wait | MPI_Wait |
| check | N/A | getNumberWaiting | N/A | MPI_Test |
| reset | N/A | reset | N/A | N/A |
Another interesting difference is that both C++ and Java provide a first-class-function like behavior where you can execute something a single time once all threads hit the barrier. This is the `CompletionFunction` in C++ and `Runnable` in Java. I think this can reduce the number of barrier calls needed in some cases. For example with the LCALS kernel:
```chpl
const nTasks = here.maxTaskPar;
var bar = new Barrier(nTasks);
coforall tid in 0..#nTasks {
while notConverged() {
computeMyChunk(tid);
bar.barrier();
if tid == 0 then updateConvergInfo();
bar.barrier();
}
}
```
I think you can write this as something like:
```chpl
const nTasks = here.maxTaskPar;
var bar = new Barrier(nTasks, CompletionFunction=lambda() { updateConvergInfo() });
coforall tid in 0..#nTasks {
while notConverged() {
computeMyChunk(tid);
bar.barrier();
}
}
```
And this would also be useful for internal usage since this is effectively how we implement the `allLocalesBarrier` today. Each node has a local barrier, and then the last task to enter that calls the `chpl_comm_barrier()`.
Java also has a much more involved "Phaser" concept:
- https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Phaser.html.
I had not heard of this until doing research for this review. Here are some preliminary resources I found about them:
- overview of 3 barrier types in java including phasers -- http://www.dre.vanderbilt.edu/~schmidt/cs891s/2020-PDFs/10.2.2-barrier-synchronization-pt2-overview-of-java-barrier-synchronizers.pdf
- slides about adding phasers to OpenMP -- http://www.ncsa.illinois.edu/Conferences/IWOMP11/program/presentations/shirako.pdf
- paper about phasers in openmp (haven't found link to full paper) -- https://link.springer.com/chapter/10.1007/978-3-642-21487-5_10
- paper about phasers from rice (Vivek was a coauthor) -- https://www.cs.rice.edu/~vs3/PDF/SPSS08-phasers.pdf