[Chapel Merge] Selectively use 64-bit ints for GPU kernel index computation

Branch: refs/heads/main
Revision: 163824dd587cf31f375e78716e4c721d88719d5b
Author: e-kayrakli
Link: Selectively use 64-bit ints for GPU kernel index computation by e-kayrakli · Pull Request #22259 · chapel-lang/chapel · GitHub
Log Message:
Selectively use 64-bit ints for GPU kernel index computation (#22259)

This addresses a bug reported in
[GPU] forall over large range doesn't enqueue enough GPU threads?.

It looks like we have been using ints in the runtime for
num_threads, and similarly dtInt[INT_SIZE_32] while generating index
computation code within GPU kernels. Both of these limited us to run GPU
kernels on loops with bounds that can fit in 32-bit ints. This is an
arbitrary limitation as GPUs can run more than 2**32 threads. This PR
fixes that.

While there improves --debugGpu output in the following ways:

  • adds grid dimensions to the output. Ideally, we should move this
    computation from the chpl-gpu-impl layer to chpl-gpu layer and print
    the info out with startVerboseGpu output. But that's more than I am
    willing to do in this bug fix PR
  • fixes an output that printed out a size_t with %d instead of %zu
  • comments out the output for chpl_gpu_memmove, which generates a ton
    of output making debugGpu useless.

[Reviewed by @DanilaFe]

Test:

  • gpu/native with NVIDIA
  • gpu/native with AMD

Compare: Comparing b84bf8b31a1b4145d4231b99ce532201864026cb...ec809edaa545ab641fffd7788a04c724b93ced71 · chapel-lang/chapel · GitHub

Diff:
M compiler/optimizations/gpuTransforms.cpp
M runtime/include/chpl-gpu-impl.h
M runtime/include/chpl-gpu.h
M runtime/src/chpl-gpu.c
M runtime/src/gpu/cuda/gpu-cuda.c
M runtime/src/gpu/rocm/gpu-rocm.c
A test/gpu/native/largeLoop.chpl
A test/gpu/native/largeLoop.good
https://github.com/chapel-lang/chapel/pull/22259.diff