Hello @xl-tian! Welcome to our Discourse!
I'll try to answer everything but in a bit of different order than you asked them:
Does this mean that Chapel developers are working towards features that allow us to declare a distributed array over multiple GPU locales, the same way we could declare distributed array on CPU locales?
That's exactly the end goal here. Something like the following is an example of how that could look like:
use BlockDist;
// an array distributed across all local GPUs
var Arr = blockDist.createArray(1..n, int, targetLocales=here.gpus);
// this will run as kernels on each GPU that Arr is distributed on
forall elem in Arr do
elem = compute();
But I also want to make a distinction between two separate tasks:
- distributed array support as I outlined above
- GPU-driven communication as you cited above
While they look related, and I am sure there will be use cases needing both, implementation-wise, they are separate. The second bullet refers to:
// A remote array that sits on a different compute node
on Locales[1] var RemoteCpuArr: [1..n] int;
on here.gpus[0] {
forall i in 1..n { // this will be a kernel
RemoteCpuArr[i] = i;
}
}
The access to RemoteCpuArr
inside the forall
loop is perfectly legal in Chapel. But right now it doesn't work because of lack of GPU-driven communication. The example is a bit contrived, admittedly. Probably a more common case is GPUs sitting across the network communicating with each other from inside kernels.
I also want to take a quick sidebar on "from inside kernels". The data allocated on a GPU can be communicated across the network today:
on here.gpus[0] var GpuArr: [1..n] int;
on Locales[1].gpus[5] {
var MyArr = GpuArr; // data moving from one GPU to another across the network
}
The snippet above should work today, where the difference is that the communication is initiated by the CPU and not the GPU. In other words, what's important is which processor is initiating the communication, and not where the data is. That being said, the current implementation for this kind of communication is a bit inefficient as the data is moved through the host memory.
Note that several things I covered here, and especially the last point, I also covered in a recent demo that was recorded. You might want to check that out since you are interested in the internal of Chapel's GPU support: https://www.youtube.com/watch?v=J0av4VJbS4o&list=PLuqM5RJ2KYFhNSlQFpOe9Sz8sftdsuAxO&index=3&ab_channel=ChapelParallelProgrammingLanguage
I am trying to find if there is a good research question on PGAS GPU programming in Chapel. Thanks!!
I believe there are plenty. Let me go over some of the things I touched upon here:
- distributed array support: I really want to see this in Chapel, though the work is more engineering than research at this point. Once that engineering effort is done, we can regroup and consider potential research directions
- GPU-driven communication: This is probably the most exciting research direction. There are many questions that needs to be answered for an efficient implementation, while a relatively rudimentary implementation to establish the correctness of the feature doesn't require a ton of engineering.
- Inefficient GPU data movement: This is also mostly engineering effort, but could be interesting and publishable work if you want to pursue it.
If so, what would syntax and the programming model look like? Would this be implemented on top of something like NVSHMEM?
The snippet above answers the programming model question, I believe. GPU-driven communication in Chapel is just a natural part of the language and the programming model supported by the global memory view, which roughly means, if something is in lexical scope, you can access it.
In terms of the underlying implementation: We don't use SHMEM for communication. We use GASNet for InfiniBand, libfabric for Slingshot, and ugni for Aries (now EOL'ed Cray XCs have Aries). We need to make GPU-driven communication work with both GASNet and libfabric. The main research-y challenge here is to handle GPU-to-CPU signaling efficiently, and then also consider how/whether we can aggregate potentially 1000s of GPU-driven communication requests to use the network more efficiently. There's also a question of whether and how GPU-oriented networks like NVLink come into play here.
I'd be excited to elaborate more on any of these topics or chat about any other Chapel-related research ideas you may have!
Engin