New Issue: Support multidimensional kernel launches

e-kayrakli1 · April 25, 2023, 6:11pm

22152, "e-kayrakli", "Support multidimensional kernel launches", "2023-04-25T18:10:08Z"

Support multidimensional kernel launches

opened 06:10PM - 25 Apr 23 UTC

type: Feature Request user issue area: GPU Support

## Background This feature request has come up in different contexts. ##…## 1. Matrix transpose For team members at HPE this was in https://github.com/Cray/chapel-private/issues/4288 @DanilaFe requested this with: > The CUDA [transpose benchmark](https://www.nvidia.com/content/cudazone/cuda_sdk/Linear_Algebra.html#transpose), as well as other transpose implementations we've seen, uses a two-dimensional kernel launch to implement the read-write coalesced version. This is because it is desirable to make reads from the input array and writes to the output array occur row-wise (i.e., not strided). To this end, the coalesced implementation creates a 16x16 memory segment in block shared memory, reads to it row-wise (writing columns) and writes from it row-wise (reading rows). The two-dimensional block size is useful for creating and populating this square piece of shared memory. > > Currently, to implement such an algorithm in Chapel requires using a #pragma "codegen for gpu" proc, and the multi-dimensional kernel launch primitive. That is to say, there is no user-facing way to launch a multi-dimensional kernel. However, judging from this transpose benchmark, it seems like we might need to consider ways for users to write code that results in multi-dimensional blocks. We want it to be possible to write a performant transpose using idiomatic Chapel. > > There are some open questions, including the behavior of multi-dimensional zippering (as well as the precise sort of language features that might be necessary to have finer-grained control over block shapes, and even how loops over two-dimensional domains should be GPUized). @mppf's thoughts were: > My understanding is that other languages / efforts with GPUs have run into the problem that the desired traversal order on GPU to achieve the best memory coalescing is different from the best traversal order on the CPU to achieve the best cache usage. This might apply to row-major vs. column-major but at least I have heard of it applying to array-of-structures vs structure-of-arrays. It is unclear to me to what degree we have this problem, but when thinking about multidimensional loops and zippering, I think a key thing to consider is what the most natural memory access pattern will be & whether or not we can make that efficient on both the GPU and the CPU. #### 2. @psath alluded to this need See https://chapel.discourse.group/t/mixing-atomic-add-with-non-atomic-read-and-towards-gpu-atomics/21350 ## Potential Idioms #### Standalone functions now, new features later How can we support this? I am somewhat strongly for using something similar to `setBlockSize` today: ```chapel on here.gpus[0] { foreach i in 1..n { setGridSize(x,y,z); // would make the grid 3 dimensional setBlockSize(a,b,c); // would make the block 3 dimensional } } ``` `setBlockSize` is a temporary solution, though. I expect that we replace these with more portable solutions via something like https://github.com/chapel-lang/chapel/issues/16405#issuecomment-991253890. Through that a potential snippet might look like: ```chapel on here.gpus[0] { foreach i in 1..n with (config cfg = new LoopContext(gridSize=(x,y,z), blockSize=(x,y,z))) { } } ``` Note that this snippet also relies on `foreach` intents, which are not supported but requested [here](https://github.com/chapel-lang/chapel/issues/18500). #### Nested loops An alternative is to compile nested order-independent loops into multidimensional kernels: ```chapel on here.gpus[0] { forall i in gridDimX { forall j in gridDimY { forall k in gridDimZ { } } } } ``` Implementation challenges aside, I think this is nested parallelism is more akin to a kernel launching another kernel aka dynamic parallelism rather than multidimensional kernels. I'll create an issue for dynamic parallelism, too. ## Implementation Details I'll ~~ramble a bit about~~ capture my thoughts here: Implementing this would involve translating the loop indices into thread indices. It may be easy for 1D loops, but more challenging for multidimensional loops, which I expect to be a more common case in scenarios where you'd want a multidimensional kernel. Today, for multidimensional loops, the outermost dimension "becomes" the kernel, where the inner dimensions are regular for loops within. First, we'll need to change that, which is probably the main challenge, but I don't think it is an obstacle. Second, we'll need to translate the thread indices into loop indices somehow. There are questions around whether you can have a 3D loop with a 2D kernel launch for example. If we were to support that, we'll need to make the outermost 2 dims the kernel, and the innermost dimension to be a for loop within. And maybe we can prevent number of ranks of the kernel from being greater than the number of ranks of the loop as the initial step? @stonea I believe you implemented support for multidimensional loops as we have today. What are your thoughts here?

Background

This feature request has come up in different contexts.

1. Matrix transpose

For team members at HPE this was in https://github.com/Cray/chapel-private/issues/4288

@DanilaFe requested this with:

The CUDA transpose benchmark, as well as other transpose implementations we've seen, uses a two-dimensional kernel launch to implement the read-write coalesced version. This is because it is desirable to make reads from the input array and writes to the output array occur row-wise (i.e., not strided). To this end, the coalesced implementation creates a 16x16 memory segment in block shared memory, reads to it row-wise (writing columns) and writes from it row-wise (reading rows). The two-dimensional block size is useful for creating and populating this square piece of shared memory.

Currently, to implement such an algorithm in Chapel requires using a #pragma "codegen for gpu" proc, and the multi-dimensional kernel launch primitive. That is to say, there is no user-facing way to launch a multi-dimensional kernel. However, judging from this transpose benchmark, it seems like we might need to consider ways for users to write code that results in multi-dimensional blocks. We want it to be possible to write a performant transpose using idiomatic Chapel.

There are some open questions, including the behavior of multi-dimensional zippering (as well as the precise sort of language features that might be necessary to have finer-grained control over block shapes, and even how loops over two-dimensional domains should be GPUized).

@mppf's thoughts were:

My understanding is that other languages / efforts with GPUs have run into the problem that the desired traversal order on GPU to achieve the best memory coalescing is different from the best traversal order on the CPU to achieve the best cache usage. This might apply to row-major vs. column-major but at least I have heard of it applying to array-of-structures vs structure-of-arrays. It is unclear to me to what degree we have this problem, but when thinking about multidimensional loops and zippering, I think a key thing to consider is what the most natural memory access pattern will be & whether or not we can make that efficient on both the GPU and the CPU.

2. @psath alluded to this need

See Mixing atomic add with non-atomic read? (and towards GPU atomics)

Potential Idioms

Standalone functions now, new features later

How can we support this? I am somewhat strongly for using something similar to setBlockSize today:

on here.gpus[0] {
  foreach i in 1..n {
    setGridSize(x,y,z);  // would make the grid 3 dimensional
    setBlockSize(a,b,c);  // would make the block 3 dimensional
  }
}

setBlockSize is a temporary solution, though. I expect that we replace these with more portable solutions via something like forall task IDs · Issue #16405 · chapel-lang/chapel · GitHub. Through that a potential snippet might look like:

on here.gpus[0] {
  foreach i in 1..n with (config cfg = new LoopContext(gridSize=(x,y,z), blockSize=(x,y,z))) {

  }
}

Note that this snippet also relies on foreach intents, which are not supported but requested here.

Nested loops

An alternative is to compile nested order-independent loops into multidimensional kernels:

on here.gpus[0] {
  forall i in gridDimX {
    forall j in gridDimY {
      forall k in gridDimZ {
      }
    }
  }
}

Implementation challenges aside, I think this is nested parallelism is more akin to a kernel launching another kernel aka dynamic parallelism rather than multidimensional kernels. I'll create an issue for dynamic parallelism, too.

Implementation Details

I'll ~~ramble a bit about~~ capture my thoughts here:

Implementing this would involve translating the loop indices into thread indices. It may be easy for 1D loops, but more challenging for multidimensional loops, which I expect to be a more common case in scenarios where you'd want a multidimensional kernel. Today, for multidimensional loops, the outermost dimension "becomes" the kernel, where the inner dimensions are regular for loops within. First, we'll need to change that, which is probably the main challenge, but I don't think it is an obstacle. Second, we'll need to translate the thread indices into loop indices somehow. There are questions around whether you can have a 3D loop with a 2D kernel launch for example. If we were to support that, we'll need to make the outermost 2 dims the kernel, and the innermost dimension to be a for loop within. And maybe we can prevent number of ranks of the kernel from being greater than the number of ranks of the loop as the initial step? @stonea I believe you implemented support for multidimensional loops as we have today. What are your thoughts here?

Topic		Replies	Views
Chapel on an open-science Nvidia GPU + ARM system Users	6	298	September 20, 2021
DimensionalDist2D no support for arbitrary dimension Developers	2	68	January 17, 2024
Announcing Chapel 1.27.0! Announcements	0	280	June 30, 2022
(2D) Array Storage schemes Users	12	177	December 23, 2022
Announcing Chapel 1.28.0! Announcements	0	335	September 16, 2022