[GPU] forall over large range doesn't enqueue enough GPU threads?

Since GPU doesn't yet support 2D/3D kernels and I'm porting a 3D one, I've linearized my index space into a single forall loop over ~5.5 billion GPU threads / unique linear IDs. I de-linearize the ID back into CUDA threadIdx/blockIdx variables so that I can keep mapping one axis of the problem onto each dimension.

    var workSize : int(64) = ((isXBlock*isXGrid)*(isYBlock*isYGrid)*(isZBlock*isZGrid));
    forall linear_id in 0..<workSize {
      var tid = get_ND_ID((isXGrid, isYGrid, isZGrid), (isXBlock, isYBlock, isZBlock), linear_id);
      ... }

According to the CUDA SDK's deviceQuery, my card (RTX 3090) can support way more than 5.5B threads.

  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)

But according to Nsight Compute, I'm only actually enqueuing ~1.2B threads, which leads to incomplete data --> lots of NaNs from a subsequent kernel.
image

A CPU version of the same kernel runs the correct number of threads and produces correct results.

Oddly, when I look at my next-largest problem size (in terms of the ID space, not work-per-thread), it successfully enqueues more threads (~1.6B), enough to match the assigned problem size of 1.6B.

In terms of moving forward... where/how would I look into the generated intermediate CUDA? and/or how is it deciding how many threads of the forall to turn into a grid/block?

Ugh, sorry you bumped into this. Selectively use 64-bit ints for GPU kernel index computation by e-kayrakli · Pull Request #22259 · chapel-lang/chapel · GitHub is the fix. We were using 32-bit ints in some critical paths that compute the loop-index to thread-index mapping. So, you're running into some overflow issues. The PR will fix that. I am hoping to merge that PR by tomorrow, hopefully later today.

Are you using main or the release? If you can use main, it'd be the solution for you to move forward. The patch there might be applicable to 1.30, and I can create that if that helps.

No worries, I know this stuff is very new! I'm on the 1.30 tarball, totally willing to do a build from main, whenever your PR is merged.

I naively tried looking at the --html --savec output and didn't learn much other than "something happens to the GPU code during the 'resolve' pass" :laughing: The PR is much more insightful.

The PR has been merged. Please let us know if there are other issues in your use case.

Yep, I get correct results now when using the main branch. Thanks!

2 Likes

Thanks for confirming (and reporting the issue), Paul, and big thanks for the super-quick fix, Engin!

-Brad

1 Like