[Chapel Merge] Add support for `gpu*Reduce` functions on AMD GPUs

Branch: refs/heads/main
Revision: e547028a5e9088b012e95c4fc631aa8590c1a711
Author: e-kayrakli
Link: Add support for `gpu*Reduce` functions on AMD GPUs by e-kayrakli · Pull Request #23950 · chapel-lang/chapel · GitHub
Log Message:
Add support for gpu*Reduce functions on AMD GPUs (#23950)

Resolves https://github.com/Cray/chapel-private/issues/5609

Continuation from where I left off in
https://github.com/chapel-lang/chapel/pull/23689. In that PR, I
struggled with segfaults with AMD GPUs, so I had to back out of AMD
support. Turns out AMD GPUs tend to segfault at execution time if you
don't use the right --offload-arch, and that the default on the system
that I tested this was not right. This PR adds that while compiling the
reduction support code in the runtime to remove the blockage.

Details

  • The runtime now has chpl_gpu_can_reduce/chpl_gpu_impl_can_reduce
    interface that returns true/false depending on whether we can use this
    cub-based reduction support
  • Today, it returns false for cpu-as-device mode or ROCm 4.x which
    doesn't have hipcub
  • For cases where there's no cub-based reduction, we fallback to regular
    CPU-based reductions. On ROCm 4.x this means we copy the array to the
    host and reduce on host. Clearly, this is less than ideal and just a
    portability stopgap. I hope to drop ROCm 4 support as soon as we can
  • Adds a new rocm-utils header to be able to use ROCM_VERSION_MAJOR
    portably, and to be able to use ROCM_CALL in multiple files
  • Moves test/gpu/native/noAmd/reduction directory to test/gpu/native
    and removes noAmd.skipif

[Reviewed by @stonea]

Test

  • nvidia
  • amd with ROCm 4.2
  • amd with ROCm 4.4
  • amd with ROCm 5.2 gpu/native/reduction only
  • amd with ROCm 5.4 gpu/native/reduction only
  • cpu gpu/native/reduction only

Compare: Comparing 36d69e2eb49e872ca4a95bf1ea8ae10c3e7ccfea...7cbd647d78c16fc4aa5d149b5aac2594fa5d690f · chapel-lang/chapel · GitHub

Diff:
M modules/standard/GPU.chpl
M runtime/include/chpl-gpu-impl.h
M runtime/include/chpl-gpu.h
M runtime/src/chpl-gpu.c
M runtime/src/gpu/amd/Makefile.share
M runtime/src/gpu/amd/gpu-amd-reduce.cc
M runtime/src/gpu/amd/gpu-amd.c
A runtime/src/gpu/common/rocm-utils.h
M runtime/src/gpu/cpu/gpu-cpu.c
M runtime/src/gpu/nvidia/gpu-nvidia.c
D test/gpu/native/noAmd.skipif
D test/gpu/native/noAmd/reduction/largeArrays.execopts
R100 test/gpu/native/noAmd/reduction/basic.chpl test/gpu/native/reduction/basic.chpl
R100 test/gpu/native/noAmd/reduction/basic.good test/gpu/native/reduction/basic.good
R057 test/gpu/native/noAmd/reduction/largeArrays.chpl test/gpu/native/reduction/largeArrays.chpl
A test/gpu/native/reduction/largeArrays.execopts
R100 test/gpu/native/noAmd/reduction/largeArrays.good test/gpu/native/reduction/largeArrays.good
R100 test/gpu/native/noAmd/reduction/largeArrays.skipif test/gpu/native/reduction/largeArrays.skipif
R067 test/gpu/native/noAmd/reduction/largeArraysMinMax.chpl test/gpu/native/reduction/largeArraysMinMax.chpl
R100 test/gpu/native/noAmd/reduction/largeArraysMinMax.compopts test/gpu/native/reduction/largeArraysMinMax.compopts
R100 test/gpu/native/noAmd/reduction/largeArraysMinMax.execopts test/gpu/native/reduction/largeArraysMinMax.execopts
R100 test/gpu/native/noAmd/reduction/largeArraysMinMax.good test/gpu/native/reduction/largeArraysMinMax.good
R100 test/gpu/native/noAmd/reduction/largeArraysMinMax.skipif test/gpu/native/reduction/largeArraysMinMax.skipif
R100 test/gpu/native/noAmd/reduction/nonZeroBased.chpl test/gpu/native/reduction/nonZeroBased.chpl
R100 test/gpu/native/noAmd/reduction/nonZeroBased.good test/gpu/native/reduction/nonZeroBased.good
R100 test/gpu/native/noAmd/reduction/reduceThroughput.chpl test/gpu/native/reduction/reduceThroughput.chpl
R100 test/gpu/native/noAmd/reduction/reduceThroughput.execopts test/gpu/native/reduction/reduceThroughput.execopts
R100 test/gpu/native/noAmd/reduction/reduceThroughput.good test/gpu/native/reduction/reduceThroughput.good
R100 test/gpu/native/noAmd/reduction/stringError.chpl test/gpu/native/reduction/stringError.chpl
R100 test/gpu/native/noAmd/reduction/stringError.good test/gpu/native/reduction/stringError.good
R100 test/gpu/native/noAmd/reduction/stringError.prediff test/gpu/native/reduction/stringError.prediff
https://github.com/chapel-lang/chapel/pull/23950.diff