Chapel array and record representation on CUDA device

How are Chapel arrays and record instances represented when they are allocated on the CUDA device? Are they flattened arrays or do they have the same representation as data in normal Chapel programs? If I have written a CUDA kernel in C, how would I call it from my Chapel program?

Hi Iain,

Records can be tricky. Our interop documentation states that non-extern Chapel records can't be passed to extern procs. So, the only way to make them work could be having a C struct, which then you declare in Chapel as an extern record. The implication is that you'd have to allocate memory on the C side for such records, presumably with things like cudaMalloc. I can try to put together an example for this, if you're interested.

You can pass pointers to Chapel arrays to C functions, which may call CUDA kernels down the road. Note that CUDA kernel launches have an esoteric syntax that's not supported by Chapel. So, there needs to be a C wrapper in between that's called from Chapel and that calls CUDA. For such cases, you can pass Chapel arrays allocated on the GPU memory like:

on here.gpus[0] {
  var Arr: [1..10] int; // this is allocated on the device memory
  externProc(c_ptrTo(Arr));  // this passes the address of the array's data buffer
}

I created Can we provide a way to call a CUDA/HIP kernel directly from Chapel? · Issue #25302 · chapel-lang/chapel · GitHub to ask for a more direct way for launching a CUDA kernel from Chapel.

Hope this helps,
Engin

Hi Engin,

Thank you very much for your response. That's too bad about records, but this is definitely helpful. I would love to see the example you mentioned if that's not too much of a hassle.

As for cuBLAS and friends, would making calls to these necessitate writing extern procs in C/CUDA that call these libraries, then use them in Chapel?

What representation do arrays of higher dimension have on the C side?

When on here.gpus[0], does c_ptrTo(Arr) do any copying of Arr? In other words, is there any overhead for manipulating Chapel arrays via C programs if they are on a GPU? Like does the array need to be copied back to program memory, then be passed into the C function, then copied to the CUDA device? or does everything remain on device?

Thanks,
Iain

Hi Iain,

I think performance may be a bit tricky, so I created Low CUDA API call performance when called from Chapel through interoperability · Issue #25311 · chapel-lang/chapel · GitHub. The example in that issue should answer your question here. Let me know if that's not the case.

As for cuBLAS and friends, would making calls to these necessitate writing extern procs in C/CUDA that call these libraries, then use them in Chapel?

Probably not. AFAIK, cuBLAS is a host library entirely. So, you should be able to invoke cuBLAS from Chapel directly. Note that we have a draft library that does exactly that. But that's only tested with CHPL_LOCALE_MODEL=flat. It has been a long-standing wish to ramp that library up in a way that it can be used in production with the GPU support enabled (CHPL_LOCALE_MODEL=gpu). See the library here: chapel/test/gpu/interop/cuBLAS at main · chapel-lang/chapel · GitHub

(On a quicker look on that library it looks like we have a C wrapper for it. I bet it has to do with C vs C++ linkage. But I can't remember really)

What representation do arrays of higher dimension have on the C side?

Local, rectangular Chapel arrays should have contiguous memory allocation, such that their buffer can be used in C. By default, such arrays have row-major ordering.

When on here.gpus[0] , does c_ptrTo(Arr) do any copying of Arr ?

No. A pointer is a pointer. If the array was allocated inside on here.gpus[0], the pointer you get will be pointing to the GPU memory. You can put that pointer in a void* in C, or CUDevicePtr. As long as, you keep in mind that it is pointing to the GPU memory and handle that pointer accordingly you should be fine.

Engin

1 Like