Chapel GPU PGAS Support

In the Chapel Introduction to GPU Programming blog, it says:

To make an array accessible from code that runs on the GPU (such as our foreach loop), the array must live on the GPU locale. In the future, Chapel aims to make it possible for GPU code to access arrays declared outside of the GPU locale. However, this would require some communication between the GPU and CPU, initiated by the GPU. GPU-driven communication like that is on our roadmap, but not supported at the time of writing.

Does this mean that Chapel developers are working towards features that allow us to declare a distributed array over multiple GPU locales, the same way we could declare distributed array on CPU locales? If so, what would syntax and the programming model look like? Would this be implemented on top of something like NVSHMEM?

I am graduate student at UMD College Park. I am building on some previous work done on optimizing Chapel compiler for irregular memory accesses. I am trying to find if there is a good research question on PGAS GPU programming in Chapel. Thanks!!

2 Likes

Hello @xl-tian! Welcome to our Discourse!

I'll try to answer everything but in a bit of different order than you asked them:

Does this mean that Chapel developers are working towards features that allow us to declare a distributed array over multiple GPU locales, the same way we could declare distributed array on CPU locales?

That's exactly the end goal here. Something like the following is an example of how that could look like:

use BlockDist;

// an array distributed across all local GPUs
var Arr = blockDist.createArray(1..n, int, targetLocales=here.gpus);

// this will run as kernels on each GPU that Arr is distributed on
forall elem in Arr do
  elem = compute();

But I also want to make a distinction between two separate tasks:

  • distributed array support as I outlined above
  • GPU-driven communication as you cited above

While they look related, and I am sure there will be use cases needing both, implementation-wise, they are separate. The second bullet refers to:

// A remote array that sits on a different compute node
on Locales[1] var RemoteCpuArr: [1..n] int; 

on here.gpus[0] {
  forall i in 1..n { // this will be a kernel
    RemoteCpuArr[i] = i;
  }
}

The access to RemoteCpuArr inside the forall loop is perfectly legal in Chapel. But right now it doesn't work because of lack of GPU-driven communication. The example is a bit contrived, admittedly. Probably a more common case is GPUs sitting across the network communicating with each other from inside kernels.

I also want to take a quick sidebar on "from inside kernels". The data allocated on a GPU can be communicated across the network today:

on here.gpus[0] var GpuArr: [1..n] int;

on Locales[1].gpus[5] {
  var MyArr = GpuArr;  // data moving from one GPU to another across the network
}

The snippet above should work today, where the difference is that the communication is initiated by the CPU and not the GPU. In other words, what's important is which processor is initiating the communication, and not where the data is. That being said, the current implementation for this kind of communication is a bit inefficient as the data is moved through the host memory.

Note that several things I covered here, and especially the last point, I also covered in a recent demo that was recorded. You might want to check that out since you are interested in the internal of Chapel's GPU support: https://www.youtube.com/watch?v=J0av4VJbS4o&list=PLuqM5RJ2KYFhNSlQFpOe9Sz8sftdsuAxO&index=3&ab_channel=ChapelParallelProgrammingLanguage

I am trying to find if there is a good research question on PGAS GPU programming in Chapel. Thanks!!

I believe there are plenty. Let me go over some of the things I touched upon here:

  1. distributed array support: I really want to see this in Chapel, though the work is more engineering than research at this point. Once that engineering effort is done, we can regroup and consider potential research directions
  2. GPU-driven communication: This is probably the most exciting research direction. There are many questions that needs to be answered for an efficient implementation, while a relatively rudimentary implementation to establish the correctness of the feature doesn't require a ton of engineering.
  3. Inefficient GPU data movement: This is also mostly engineering effort, but could be interesting and publishable work if you want to pursue it.

If so, what would syntax and the programming model look like? Would this be implemented on top of something like NVSHMEM?

The snippet above answers the programming model question, I believe. GPU-driven communication in Chapel is just a natural part of the language and the programming model supported by the global memory view, which roughly means, if something is in lexical scope, you can access it.

In terms of the underlying implementation: We don't use SHMEM for communication. We use GASNet for InfiniBand, libfabric for Slingshot, and ugni for Aries (now EOL'ed Cray XCs have Aries). We need to make GPU-driven communication work with both GASNet and libfabric. The main research-y challenge here is to handle GPU-to-CPU signaling efficiently, and then also consider how/whether we can aggregate potentially 1000s of GPU-driven communication requests to use the network more efficiently. There's also a question of whether and how GPU-oriented networks like NVLink come into play here.

I'd be excited to elaborate more on any of these topics or chat about any other Chapel-related research ideas you may have!

Engin

3 Likes

Hi Engin. Thanks for the detailed response!! I will definitely start by looking at the code you mentioned in the video. I am very interested in how the distributed data structures and GPU driven communication work in Chapel. If you could provide more pointers resources/code, that will help a lot!

For example, what will happen in this code when Arr[i] = exp(Arr[j] + Arr[k]) is executed?

use Math;

// Assume 2 nodes, each node has 4 GPUs
var gpus : [1...8] locale;
gpus[1...4] = Locale[0].gpus;
gpus[5...8] = Locale[1].gpus;

// Arr is distributed over the 8 GPUs on two nodes
var Arr = blockDist.createArray(1..n, real, targetLocales=gpus);

// I assume that this forall loop will be a kernel
forall i in Arr.domain do {
  var j = 0;
  var k = 0;

  // Compute j, k, which depend on Arr[i], and will take a long time
  j = computej(Arr[i]);
  k = computek(Arr[i]);

  Arr[i] = exp(Arr[j] + Arr[k]);

  // Do other expensive computations which change Arr[i]
  Arr[i] = update(Arr[i]);
}

Additionally, what will happen in this code when LocalGpuArr[i] = RemoteGpuArr[j] is executed?

// A remote array that sits on a different compute node's GPU
on Locales[1].gpus[0] var RemoteGpuArr: [1..n] int; 

on here.gpus[0] {
  var LocalGpuArr: [1..n] int;

  forall i in 1..n { // this will be a kernel
    // Compute index j and value threshold, which depends on LocalGpuArr[i] and will take a long time
    var j = computej(LocalGpuArr[i]);
    var threshold = computet(LocalGpuArr[i]);

    // If threshold < some known number, access RemoteGpuArr[j]
    if (threshold < 25) {
      LocalGpuArr[i] = RemoteGpuArr[j];
    }

    // Do other expensive computations which change Arr[i]
    Arr[i] = update(Arr[i]);
  }
}

What I am a bit confused about, is that without NVSHMEM, how could we do remote communication to access nonlocal data inside (for example) a cuda kernel? Thanks!!

I think both of those snippets are doing the same thing in different ways. Arr[j], Arr[k], and RemoteGpuArr[j] are all problematic because of the same reason: they could all be remote, and if that happens you need GPU-driven communication.

What I am a bit confused about, is that without NVSHMEM, how could we do remote communication to access nonlocal data inside (for example) a cuda kernel? Thanks!!

Before digging deeper, the questions that you are asking and we are trying to answer here are fundamentally at the same level as where NVSHMEM sits. IOW, the right question to ask here is how can we do something similar to what NVSHMEM does. We don't use SHMEM to have SHMEM-like behavior. Similarly, we don't need NVSHMEM to have NVSHMEM-like behavior. We will have to implement something from lower-level APIs which will do what we want.


Let's go back to your examples. I recommend writing the first snippet, but without targetLocales or even GPU support enabled to see what kind of code we generate.

Here's a version of that snippet that works without GPUs:

use Math;
use BlockDist;

config const n = 10;

// Arr is distributed over the 8 GPUs on two nodes
var Arr = blockDist.createArray(1..n, real);

// I assume that this forall loop will be a kernel
forall i in Arr.domain do {
  var j = 0;
  Arr[i] = Arr[j];
}

writeln(Arr);

Here's how I compile:

# disable the remote cache in the runtime for simplicity
# save the generated code
chpl foo.chpl --no-cache-remote --savec gen_code

Now, take a look at the generated LLVM:

cd gen_code
llvm-dis chpl__module-nopt.bc
vim chpl__module-nopt.ll

In that code, you will see calls to chpl_gen_comm_get. One of them must be handling Arr[j]. It doesn't really matter which one. That's how Chapel implements the one-sided "get" operation.

  1. Here's the implementation of chpl_gen_comm_get: chapel/runtime/include/chpl-comm-compiler-macros.h at main · chapel-lang/chapel · GitHub. You can track down each branch, but I recommend focusing on --no-cache-remote compilation, which would end up in chpl_comm_get branch if the data happens to be remote.

  2. Assuming you are running with GASNet (anything multinode that's not Cray), that will lead you to: chapel/runtime/src/comm/gasnet/comm-gasnet-ex.c at main · chapel-lang/chapel · GitHub. The details of this implementation is not that important, but this is where we end up using GASNet for moving the data around, which is arguably lower level than SHMEM, for example.

When the GPUs are at play, we want to have a __device__ version of chpl_gen_comm_get, which is in fact something we have: chapel/runtime/include/gpu/chpl-gpu-gen-common.h at 0fbac9583fc77ad16297c54162c9264f75ae1e4b · chapel-lang/chapel · GitHub. But it is just a stub that generates a warning. Currently, our compiler tries hard to not generate chpl_gen_comm_get, because we put an implicit local block inside the generated kernels here: chapel/compiler/optimizations/gpuTransforms.cpp at 0fbac9583fc77ad16297c54162c9264f75ae1e4b · chapel-lang/chapel · GitHub.

Now, imagine removing that line from the compiler. In that case, the code you have will probably generate chpl_gen_comm_get calls within kernels which will end up calling that stub and generate warnings. At that point the question becomes how to implement a proper __device__ chpl_gen_comm_get that's doing actual communication.

Here's a very crude outline of how that might work:

  1. we have a per-gpu signaling flag which the device sets when a get is called from device.
  2. the host thread that launched the kernel polls that variable
    a. this signaling has to be done atomically, most likely via something like 1. Introduction — CUDA C++ Programming Guide
  3. when the host sees the data to be available, it starts the communication
    a. for this to happen, the communication layer (GASNet, for this example) has to be able to move data that's in the GPU memory without having to copy it into the host memory. The GASNet feature we need to use for that is called "memory kinds": https://gasnet.lbl.gov/docs/memory_kinds.pdf

At this point, we should be able to have GPU-driven gets to work, where I imagine put to have a relatively symmetric implementation.

Note that what I have above should work, but will perform really bad. At that point, there are quite a lot of interesting questions and challenges for optimizing things such that when 1000s of GPU threads issue 8-byte gets at the same time, we do something smart to unblock those threads as quickly as possible. I do have some ideas for that, but they are very rough and we should probably save that discussion for another day. We could also learn from NVSHMEM's publicly available talks etc.

I waved my hand a lot, but hopefully this can help create a framework for your thought process. Let me know if you have any further questions.

Engin