19860, "Guillaume-Helbecque", "DistributedBag module: inter-locale work stealing mechanism", "2022-05-22T18:37:18Z"
Summary of Problem
Theoretically, when we declare a distributed bag, each locale has its own bag instance, and a dynamic inter-locale work stealing mechanism is supposed to occur when a locale's bag instance is empty.
Let's consider 2 locales for the description of the issue (but it is also true for any number of locales):
When locale 0 has a non-empty bag instance and locale 1 has an empty bag instance, locale 1 is not able to steal work on locale 0. This happens either when locale 1 starts the execution with an empty bag instance, either when its bag instance becomes empty during execution.
if !targetBag!.loadBalanceInProgress.read() { ... }
This condition occurs in the REMOVE_WORST_CASE, where we attempt to steal work on another locale. It seems that the two other cases REMOVE_BEST_CASE and REMOVE_AVERAGE_CASE work fine.
Steps to Reproduce
Source Code:
use DistributedBag;
config const n = 2000;
var bag = new DistBag(int, targetLocales=Locales); // creation of the bag
bag.addBulk(1..n); // insertion of elements
writeln("Initial bag size: ", bag.getSize(), "\n");
coforall loc in Locales do on loc do {
var counter: atomic int = 0; // local counter
coforall tid in 0..#here.maxTaskPar do {
while true {
var (empty, x): (bool, int) = bag.remove(); // we try to remove an element
if (empty == true){ // if we successfully removed an element
counter.add(1); // update the counter
}
if (bag.getSize() == 0){ // if the global bag is empty
break;
}
}
} // end coforall threads
writeln(counter, " element(s) removed from ", here);
} // end coforall locales
Compile command:
chpl foo.chpl -o foo.o --fast
Execution command:
./foo.o -nl 2
Output:
Initial bag size: 2000
0 element(s) removed from LOCALE1
2000 element(s) removed from LOCALE0
Due to the dynamic inter-locale work stealing mechanism, we expected the locales to remove approximately the same number of elements.
Configuration Information
Output of chpl --version: 1.26.0
Output of $CHPL_HOME/util/printchplenv --anonymize: