New Issue: generalized iteration IDs

16405, “mppf”, “generalized iteration IDs”, “2020-09-17T19:07:27Z”

This issue proposes a mechanism whereby the locale, task, and vector lane position within a forall loop can be queried within that loop or within a follower iterator implementing that loop.

This issue builds upon #14405, CHIP 17, and CHIP 22

Generally speaking a standalone iterator might look like this:

iter myiter(... tag=standalone...) {
  coforall loc in targetLocales {
    on loc {
      const numChunks = _computeNumChunks(...);
      coforall t in 1..numChunks {
        const part = computePartforChunk(...);
        for i in part {
          yield i;
         }
      }
    }
}

This iterator has 3 portions:

  1. Divide work among locales (coforall loc on loc)
  2. Divide work among tasks (coforall t in 1..numChunks)
  3. Indicate the work per task (for i in part)

We are supposing that the 3rd portion also can control vectorization (TODO link issue).

In order to enable a SIMT programming model, we would like code invoking this forall to be able to create a variable per-vector-lane and to perform reductions across vector lanes. It is already possible to create a per-task variable:

forall i in myiter() with (var taskLocalVar=0) { ... }

Or to reduce across the tasks:

forall i in myiter() with (+ reduce x) { ... }

TODO link issue proposes that these mechanisms be extended to create per-vector-lane variables and per-task variables.

In order to support SIMT programming or SPMD programming it is also necessary to allow one to get a task ID. #14405 talks about one use case for that and CHIP 17 has other examples. In GPU programming this kind of functionality is necessary to get the threads in a warp to combine efforts in order to move data around efficiently. For example, to perform matrix transpose, one might want to copy a block of a matrix to nearby memory; then barrier; and then have the threads collectively store the transpose into the correct memory location.

Specifically, this issue proposes that each forall loop create and track several numbers in task-local storage:

  • (perhaps) current locale index and the how many locales are iterated over
  • current qthreads-style task and how many tasks were created
  • current vector lane and how many vector lanes were created

In addition, forall loops can store details of the iteration space in task-local storage.

There are quite a few different possible constructions for parallel iterators. Tasks might be created in sublocales. Perhaps no locales are iterated over. Perhaps tasks are created during iteration.

This iterator that creates tasks in a nested way:

  coforall t1 in chunk1 {
     coforall t2 in chunk2 {
     }
  }

This one creates tasks for sublocales:

  coforall loc in targetLocales {
     on loc {
       coforall subloc in loc.getChildren() {
         on subloc {
            .... yield ...;
         }
      }
   }

At this point we have 2 options:

  1. Create task ids for each level of coforall in which the tasks are created. This has the drawback of creating arbitrary heirarchy in the task ID information that might be difficult for the author of a forall to predict. For example, when using a different locale model, an iterator might create more levels of tasks. The forall might expect something in particular for the type information of the task IDs that does not apply in that configuration.
  2. Flatten the task ids in some manner. This has the drawback of potentially limiting what the user can do in terms of working across different levels of the hierarchy. However at least it is more possible to create code that will work on other configurations. The question is, how many levels of IDs do we consider?

I’m pretty sure we need to know the current “Vector lane” or, in GPU terms, the current “thread within the block”.

Here is an example of a matrix transpose (from https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/ ).

// matrix transpose setup
const Dom = {1..n, 1..n};
var Input:[Dom] int;
var Output:[Dom] int;
param t = 32; // tile size dimension
// naive matrix transpose
forall (i, j) in Dom {
  Output[j,i] = Input[i,j];
}

What if we want to copy the matrix to some scratch storage first in order to do a tiled transpose?

// tiled matrix transpose
forall (i, j) in {1..n by t, 1..n by t} {
  var tile:[0..#t, 0..#t] int;
  tile = Input[i..#t, j..#t];
  Output[j..#t, i..#t] = tile;
}

But, that has the drawback that the forall loop needed to be adjusted. Can we use an SPMD/SIMT idea to perform the tiled copy without adjusting the bounds of the forall? Because each loop iteration can still be responsible for copying one element to the transpose destination, right? (Also, one might want to have a forall over the elements doing multiple things; like perhaps computing on the result of the transpose after it is done).

// tiled matrix transpose
forall (i, j) in Dom {
  const (idX, idY) = Forall.getCurrentVectorLane();
  const (nX, nY) =  = Forall.getNumberOfVectorLanes();
  const tileStartX = i - idX;
  const tileStartY = j - idY;
  "GPU block shared" var tile:[0..#nX, 0..#nY] int; // notional, but somehow, declare one of these per GPU block
  // store into the tile; note that j represents the fastest
  // changing part of the address so we want those reads
  // to be in the least dimension
  tile[idX, idY] = Input[i,j];
  // barrier all vector lanes
  Forall.barrierVectorLanes();
  // store the transpose of the tile
  // but do so in a way where the fastest changing part of the
  // address is the fastest changing part of the loop
  // (so the second dimension of the access is determined by idY)
  Output[tileStartY + idX, tileStartX + idY] = tile[idX, idY];
}

This example only used the last level of the IDs (for vector lanes / GPU block information). Could we do a similar thing for a distributed matrix transpose?

// tiled matrix transpose assuming Dom is distributed
forall (i, j) in Dom {
  const (idX, idY) = Forall.getTaskWithinLocale();
  const (nX, nY) =  = Forall.getNumberOfTasksWithinLocale();
  const tileStartX = i - idX;
  const tileStartY = j - idY;
  "locale shared" var tile:[0..#nX, 0..#nY] int; // notional, but somehow, declare one of these per locale
  // just one task within the locale does a bulk GET for the data,
  // instead of doing GETs for each element
  if  (idX, idY) == (0,0) {
    // do a GET to read into the tile
    tile = Input[tileStartX..#nX, tileStartY..#nY];
    // store the transpose of the tile
    Output[j..#t, i..#t] = tile;
  }
  // barrier the tasks within the locale to make sure the GETs are done
  Forall.barrierTasksWithinLocale();
  // Compute further on the output
}

Does the iteration ID mechanism need to have a level for locales? Or is qthreads-style tasks and vector lanes sufficient? Can the tasks within a locale use locks or other shared data structures instead of the ID based things?