I am interested in learning more about how the GPU code generation works. More specifically I would like to know more about how you guys are tackling the problem of feature discrepancies between the vendors. Do you plan to implement a standard set GPU functionality for all vendors?
Welcome to Chapel's discourse!
I am interested in learning more about how the GPU code generation works.
Kind of a more high-level answer than your specific question asks about but let me describe the overall picture first. For code generation specifically, we generate LLVM IR and rely on LLVM/clang to generate the vendor-specific low-level code/assembly (PTX or GCN). These are then compiled with vendor-specific tools to generate the device binary.
Code generation and compilation in general only captures part of the question you are asking though. We also have a lot of module code, mostly in the
GPU module that you have to
import yourself, and to some extent in an internal module whose interface is available to you without doing anything. These modules typically end up calling a runtime functions as externs to use devices. Typical examples we have today are atomic operations in
gpuAtomic* functions and reduce operations in
gpu*Reduce functions (both interfaces are subject to change in order to make them mesh well with Chapel's
atomic types and
reduce expressions). Our runtime is modular w.r.t. the GPU vendor you use. So, the high-level runtime interface is vendor neutral but based on with which vendor you build the runtime for, we end up calling a low-level runtime interface under the hood. All of this is "hidden" from the module.
Slides 28 and 29 here have some diagrams describing compilation and the runtime structure that might also help: https://chapel-lang.org/presentations/EnginTechTalk2024-static-public.pdf
You might also want to check the GPU module documentation: GPU — Chapel Documentation 1.34
More specifically I would like to know more about how you guys are tackling the problem of feature discrepancies between the vendors.
We have run into some mismatches between the vendors. But I'd like to think that they are quite corner-case-y (?). One that I can find on a quick look is that
gpuAtomicMax doesn't work with 64-bit integrals (
uint) because the ROCm version we were working on didn't have support for that operation. Things may have already improved in ROCm 6, but we haven't gotten the chance to look into it, yet.
We've also run into some differences w.r.t. peer-to-peer data transfer interface capabilities. But they are a bit more on the lighter-weight side: GPU Programming — Chapel Documentation 1.34
Do you have something else in mind? I imagine there must be other discrepancies between the two vendors (and Intel, down the road for us) that we haven't brushed up against, yet.
Do you plan to implement a standard set GPU functionality for all vendors?
That's our ultimate goal. At a high-level there are two strategies for cases where there's a discrepancy:
Make it work through other means, no matter how slow it might get: If vendor X doesn't support
foo, but Y does support it, we would strive for having the corresponding feature/codegen capability to be able to use
foowhen Y is used. For X, we should look for ways of implementing the same functionality ourselves to have a fully-portable interface. The atomic example doesn't do that obviously. We'll always have to make priority decisions for cases like that. "Does it worth to spend potentially a lot of effort to implement a fall-back or should we just error out?".
Rely on multi-resolution design principles: Chapel's multi-resolution design means that there should be low-level means for users to leverage hardware capabilities. So, the user can choose between writing high-level, easy-to-maintain, portable code vs low-level, easy-to-control, potentially-less-portable code. Building on that, sometimes we discussed having submodules inside the
GPUmodule, so that you can
import GPU.NVIDIA as nvidiaand then do things like
nvidia.doThis. In this scenario, this call will not be portable, but will give you the
doThisfunctionality one vendor has and the other doesn't.
With all that, I must also acknowledge that we don't support Intel GPUs today. As of today, supporting Intel GPUs is a bit hard to prioritize in the grand scheme of things, but it may change of course. My answers are mostly based on our experiences with NVIDIA and AMD so far.
Hope this helps.
Thank you for the reply and your answer is definitely helpfuly! I am interested in heterogeneous programming standards and have been thinking about how to deal with the discrepancy problem. From what I can gather:
Make it work through other means, no matter how slow it might get means putting portablilty first and
Rely on multi-resolution desing principles puts performance first. Would you agree with that description?
I certainly hope that Chapel is a good solution to this problem
Make it work through other means, no matter how slow it might getmeans putting portablilty first and
Rely on multi-resolution desing principlesputs performance first. Would you agree with that description?
That sounds right. I also wanted to clarify that these are not mutually exclusive. We can have both solutions in-place for a given discrepancy. If the feature in question is not in a time-critical portion of the code, you can use our high-level portability fallback and forget about it completely. If the feature is used in a time-critical portion than you can spend the effort to use the non-portable, low-level solution.
This is just my mental framework for tackling challenges in the design. What I'd like us to do for a specific discrepancy would depend on what it looks like and how serious it is, really.