[Chapel Merge] Implement allocation support for GPU sublocales

Branch: refs/heads/main
Revision: d89e062
Author: e-kayrakli
Log Message:

Merge pull request #18199 from e-kayrakli/gpu-mem

Implement allocation support for GPU sublocales

This PR mainly implements special allocators for the GPU sublocale.

on here.getChild(1) { // assume this is a gpu sublocale
  var a: [1..3] int;
  // ...
}

We want the array in the snippet above to be allocated on the GPU. In order to
do that, we need to make sure that both (1) array instance, and (2) its ddata is
allocated on the memory accessible by the GPU.

Background

Most of the heap allocations are managed by 5 functions implemented in
the LocaleModelHelpMem module. These are:

  • chpl_here_alloc
  • chpl_here_calloc
  • chpl_here_aligned_alloc
  • chpl_here_realloc
  • chpl_here_free

These functions typically call into runtime functions that are named
chpl_mem_*.

However, the data allocations for the array are handled by separate but similar
functions implemented in runtime/include/chpl-mem-array.h. That's done to
register the array data with the comm layer for faster communication.

This PR:

  • Implements chpl_gpu_mem_* functions that correspond to those 5 above.
  • Implements chpl_here_* functions in the GPU locale model that calls into the
    appropriate runtime function depending on where we are executing etc.
  • Adjusts the allocator and the deallocator in chpl-mem-array that's used by
    ddata allocations to use chpl_gpu_* as appropriate.

The end result is the snippet above ends up allocating the array's instance and
the ddata on the unified memory that's accesible from GPU and the CPU.

We eventually want to refactor this implementation. The idea is captured here:

https://github.com/Cray/chapel-private/issues/2406

Unified vs Device Memory

I initially started implementing this with device memory in mind. However,
there are some thorny areas there. If we use device memory, the array instance
will be allocated on the GPU memory, therefore, we'd need to initialize the
instance on the GPU. We aren't ready for that yet. There's also a question
whether we should actually do that sequential operation on the GPU or try to
move that instance from CPU to GPU after initialization for performance
purposes. More on this:

https://github.com/Cray/chapel-private/issues/2296#issuecomment-894543245

More implementation details:

  • Drops the number of arguments actual from the kernel launch primitives.
  • Adjusts tests that invoke the kernel launch primitive.
  • Adds a more proper chpl_gpu_is_device_ptr helper in the runtime.
  • Lets chpl_gpu_mem_memalign give a "not implemented" error.
  • Adds two new tests.

[Reviewed by @mppf and @ronawho, with input from @gbtitus]

Test

  • [x] test/gpu/native passes with the nightly GPU testing config

  • [x] standard

    Modified Files:
    A test/gpu/native/memory/COMPOPTS
    A test/gpu/native/memory/EXECENV
    A test/gpu/native/memory/basic.chpl
    A test/gpu/native/memory/basic.good
    A test/gpu/native/memory/dr.chpl
    A test/gpu/native/memory/dr.good
    M compiler/codegen/cg-expr.cpp
    M compiler/optimizations/deadCodeElimination.cpp
    M modules/internal/ChapelBase.chpl
    M modules/internal/localeModels/gpu/LocaleModel.chpl
    M runtime/include/chpl-gpu.h
    M runtime/include/chpl-mem-array.h
    M runtime/src/chpl-gpu.c
    M test/gpu/native/gpuAddNums/gpuAddNums_primitive.chpl
    M test/gpu/native/streamPrototype/gpuOutline.chpl
    M test/gpu/native/threadBlockAndGridPrimitives.chpl

    Compare: Comparing b521dc98825e...d89e062d78ad · chapel-lang/chapel · GitHub