New Issue: Support multidimensional kernel launches

22152, "e-kayrakli", "Support multidimensional kernel launches", "2023-04-25T18:10:08Z"

Background

This feature request has come up in different contexts.

1. Matrix transpose

For team members at HPE this was in https://github.com/Cray/chapel-private/issues/4288

@DanilaFe requested this with:

The CUDA transpose benchmark, as well as other transpose implementations we've seen, uses a two-dimensional kernel launch to implement the read-write coalesced version. This is because it is desirable to make reads from the input array and writes to the output array occur row-wise (i.e., not strided). To this end, the coalesced implementation creates a 16x16 memory segment in block shared memory, reads to it row-wise (writing columns) and writes from it row-wise (reading rows). The two-dimensional block size is useful for creating and populating this square piece of shared memory.

Currently, to implement such an algorithm in Chapel requires using a #pragma "codegen for gpu" proc, and the multi-dimensional kernel launch primitive. That is to say, there is no user-facing way to launch a multi-dimensional kernel. However, judging from this transpose benchmark, it seems like we might need to consider ways for users to write code that results in multi-dimensional blocks. We want it to be possible to write a performant transpose using idiomatic Chapel.

There are some open questions, including the behavior of multi-dimensional zippering (as well as the precise sort of language features that might be necessary to have finer-grained control over block shapes, and even how loops over two-dimensional domains should be GPUized).

@mppf's thoughts were:

My understanding is that other languages / efforts with GPUs have run into the problem that the desired traversal order on GPU to achieve the best memory coalescing is different from the best traversal order on the CPU to achieve the best cache usage. This might apply to row-major vs. column-major but at least I have heard of it applying to array-of-structures vs structure-of-arrays. It is unclear to me to what degree we have this problem, but when thinking about multidimensional loops and zippering, I think a key thing to consider is what the most natural memory access pattern will be & whether or not we can make that efficient on both the GPU and the CPU.

2. @psath alluded to this need

See Mixing atomic add with non-atomic read? (and towards GPU atomics)

Potential Idioms

Standalone functions now, new features later

How can we support this? I am somewhat strongly for using something similar to setBlockSize today:

on here.gpus[0] {
  foreach i in 1..n {
    setGridSize(x,y,z);  // would make the grid 3 dimensional
    setBlockSize(a,b,c);  // would make the block 3 dimensional
  }
}

setBlockSize is a temporary solution, though. I expect that we replace these with more portable solutions via something like forall task IDs · Issue #16405 · chapel-lang/chapel · GitHub. Through that a potential snippet might look like:

on here.gpus[0] {
  foreach i in 1..n with (config cfg = new LoopContext(gridSize=(x,y,z), blockSize=(x,y,z))) {

  }
}

Note that this snippet also relies on foreach intents, which are not supported but requested here.

Nested loops

An alternative is to compile nested order-independent loops into multidimensional kernels:

on here.gpus[0] {
  forall i in gridDimX {
    forall j in gridDimY {
      forall k in gridDimZ {
      }
    }
  }
}

Implementation challenges aside, I think this is nested parallelism is more akin to a kernel launching another kernel aka dynamic parallelism rather than multidimensional kernels. I'll create an issue for dynamic parallelism, too.

Implementation Details

I'll ramble a bit about capture my thoughts here:

Implementing this would involve translating the loop indices into thread indices. It may be easy for 1D loops, but more challenging for multidimensional loops, which I expect to be a more common case in scenarios where you'd want a multidimensional kernel. Today, for multidimensional loops, the outermost dimension "becomes" the kernel, where the inner dimensions are regular for loops within. First, we'll need to change that, which is probably the main challenge, but I don't think it is an obstacle. Second, we'll need to translate the thread indices into loop indices somehow. There are questions around whether you can have a 3D loop with a 2D kernel launch for example. If we were to support that, we'll need to make the outermost 2 dims the kernel, and the innermost dimension to be a for loop within. And maybe we can prevent number of ranks of the kernel from being greater than the number of ranks of the loop as the initial step? @stonea I believe you implemented support for multidimensional loops as we have today. What are your thoughts here?