External Issue: DistributedBag module: inter-locale work stealing mechanism

19860, "Guillaume-Helbecque", "DistributedBag module: inter-locale work stealing mechanism", "2022-05-22T18:37:18Z"

Summary of Problem

Theoretically, when we declare a distributed bag, each locale has its own bag instance, and a dynamic inter-locale work stealing mechanism is supposed to occur when a locale's bag instance is empty.

Let's consider 2 locales for the description of the issue (but it is also true for any number of locales):
When locale 0 has a non-empty bag instance and locale 1 has an empty bag instance, locale 1 is not able to steal work on locale 0. This happens either when locale 1 starts the execution with an empty bag instance, either when its bag instance becomes empty during execution.

We took a look at chapel/DistributedBag.chpl at main · chapel-lang/chapel · GitHub and found that this behavior seems to come from an if condition that is never verified:

if !targetBag!.loadBalanceInProgress.read() { ... }

This condition occurs in the REMOVE_WORST_CASE, where we attempt to steal work on another locale. It seems that the two other cases REMOVE_BEST_CASE and REMOVE_AVERAGE_CASE work fine.

Steps to Reproduce

Source Code:

use DistributedBag;

config const n = 2000;

var bag = new DistBag(int, targetLocales=Locales); // creation of the bag

bag.addBulk(1..n); // insertion of elements

writeln("Initial bag size: ", bag.getSize(), "\n");

coforall loc in Locales do on loc do {
  var counter: atomic int = 0; // local counter

  coforall tid in 0..#here.maxTaskPar do {

    while true {
      var (empty, x): (bool, int) = bag.remove(); // we try to remove an element
      if (empty == true){ // if we successfully removed an element
        counter.add(1); // update the counter
      }

      if (bag.getSize() == 0){ // if the global bag is empty
        break;
      }
    }

  } // end coforall threads

  writeln(counter, " element(s) removed from ", here);
} // end coforall locales

Compile command:

chpl foo.chpl -o foo.o --fast

Execution command:

./foo.o -nl 2

Output:

Initial bag size: 2000

0 element(s) removed from LOCALE1
2000 element(s) removed from LOCALE0

Due to the dynamic inter-locale work stealing mechanism, we expected the locales to remove approximately the same number of elements.

Configuration Information

  • Output of chpl --version: 1.26.0
  • Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: ofi *
  CHPL_GASNET_SEGMENT: everything
CHPL_TASKS: qthreads
CHPL_LAUNCHER: gasnetrun_ofi
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: none *
CHPL_AUX_FILESYS: none
  • Back-end compiler and version: gcc (GCC) 6.4.0