Branch: refs/heads/main
Revision: 163824dd587cf31f375e78716e4c721d88719d5b
Author: e-kayrakli
Link: Selectively use 64-bit ints for GPU kernel index computation by e-kayrakli · Pull Request #22259 · chapel-lang/chapel · GitHub
Log Message:
Selectively use 64-bit ints for GPU kernel index computation (#22259)
This addresses a bug reported in
[GPU] forall over large range doesn't enqueue enough GPU threads?.
It looks like we have been using int
s in the runtime for
num_threads
, and similarly dtInt[INT_SIZE_32]
while generating index
computation code within GPU kernels. Both of these limited us to run GPU
kernels on loops with bounds that can fit in 32-bit ints. This is an
arbitrary limitation as GPUs can run more than 2**32 threads. This PR
fixes that.
While there improves --debugGpu
output in the following ways:
- adds grid dimensions to the output. Ideally, we should move this
computation from thechpl-gpu-impl
layer tochpl-gpu
layer and print
the info out withstartVerboseGpu
output. But that's more than I am
willing to do in this bug fix PR - fixes an output that printed out a
size_t
with%d
instead of%zu
- comments out the output for
chpl_gpu_memmove
, which generates a ton
of output makingdebugGpu
useless.
[Reviewed by @DanilaFe]
Test:
- gpu/native with NVIDIA
- gpu/native with AMD
Diff:
M compiler/optimizations/gpuTransforms.cpp
M runtime/include/chpl-gpu-impl.h
M runtime/include/chpl-gpu.h
M runtime/src/chpl-gpu.c
M runtime/src/gpu/cuda/gpu-cuda.c
M runtime/src/gpu/rocm/gpu-rocm.c
A test/gpu/native/largeLoop.chpl
A test/gpu/native/largeLoop.good
https://github.com/chapel-lang/chapel/pull/22259.diff