28444, "jabraham17", "[Bug]: Interop in AMD GPU kernels with newer ROCm versions does not work well", "2026-02-23T21:57:35Z"
With newer ROCm versions, we are finding that doing various forms of interop in kernels generated from Chapel causes crashes.
As outlined in Initial support for ROCm 7 by jabraham17 · Pull Request #28423 · chapel-lang/chapel · GitHub, various tests fail when trying to do interop. "Interop" can look like several things
- calling gpuWrite, which relies on printf in a kernel
- halting, which relies on printf and has other issues (Better halting error message with ROCm 6.3 · Issue #28415 · chapel-lang/chapel · GitHub)
- using an extern proc defined with
__device__ __host__
The common theme is around printing functions like printf in a kernel. Sometimes normal interop works ok (like calling external kernels), but the most part the issue is with printf. Its not clear if this is because of IO, the special-ish nature of printf in C (var args....), or if that just happens to be the case of interop most prevalent in our test systems
We see the halting issue with ROCm 6.3, but the general printf problems are with ROCm 7.
We had previously seen issues similar to this when upgrading LLVM versions, and the issue could lie there as well. Previously, the fix we used for that was to force Chapel compilation to use the AMD LLVM, not upstream (https://github.com/Cray/chapel-private/issues/5469 has lots of details for those with access). However, we stopped using the AMD LLVM for several reasons
- AMD stopped shipping the full LLVM library in their binary builds of ROCm/hip/clang
- AMD doesn't fork from upstream LLVM consistently, which breaks our LLVM compatibility code (i.e. llvm-config reports version A, but the code for a specific API is from version B)
We switched to using upstream LLVM again for ROCm 6 because things seemed to be working again. Now they aren't again with ROCm 7. I haven't proved this is the same/related issue, but it seems likely