Branch: refs/heads/main
Revision: d89e062
Author: e-kayrakli
Log Message:
Merge pull request #18199 from e-kayrakli/gpu-mem
Implement allocation support for GPU sublocales
This PR mainly implements special allocators for the GPU sublocale.
on here.getChild(1) { // assume this is a gpu sublocale
var a: [1..3] int;
// ...
}
We want the array in the snippet above to be allocated on the GPU. In order to
do that, we need to make sure that both (1) array instance, and (2) its ddata is
allocated on the memory accessible by the GPU.
Background
Most of the heap allocations are managed by 5 functions implemented in
the LocaleModelHelpMem module. These are:
chpl_here_alloc
chpl_here_calloc
chpl_here_aligned_alloc
chpl_here_realloc
chpl_here_free
These functions typically call into runtime functions that are named
chpl_mem_*
.
However, the data allocations for the array are handled by separate but similar
functions implemented in runtime/include/chpl-mem-array.h
. That's done to
register the array data with the comm layer for faster communication.
This PR:
- Implements
chpl_gpu_mem_*
functions that correspond to those 5 above. - Implements
chpl_here_*
functions in the GPU locale model that calls into the
appropriate runtime function depending on where we are executing etc. - Adjusts the allocator and the deallocator in
chpl-mem-array
that's used by
ddata allocations to usechpl_gpu_*
as appropriate.
The end result is the snippet above ends up allocating the array's instance and
the ddata on the unified memory that's accesible from GPU and the CPU.
We eventually want to refactor this implementation. The idea is captured here:
https://github.com/Cray/chapel-private/issues/2406
Unified vs Device Memory
I initially started implementing this with device memory in mind. However,
there are some thorny areas there. If we use device memory, the array instance
will be allocated on the GPU memory, therefore, we'd need to initialize the
instance on the GPU. We aren't ready for that yet. There's also a question
whether we should actually do that sequential operation on the GPU or try to
move that instance from CPU to GPU after initialization for performance
purposes. More on this:
https://github.com/Cray/chapel-private/issues/2296#issuecomment-894543245
More implementation details:
- Drops the number of arguments actual from the kernel launch primitives.
- Adjusts tests that invoke the kernel launch primitive.
- Adds a more proper
chpl_gpu_is_device_ptr
helper in the runtime. - Lets
chpl_gpu_mem_memalign
give a "not implemented" error. - Adds two new tests.
[Reviewed by @mppf and @ronawho, with input from @gbtitus]
Test
-
[x]
test/gpu/native
passes with the nightly GPU testing config -
[x] standard
Modified Files:
A test/gpu/native/memory/COMPOPTS
A test/gpu/native/memory/EXECENV
A test/gpu/native/memory/basic.chpl
A test/gpu/native/memory/basic.good
A test/gpu/native/memory/dr.chpl
A test/gpu/native/memory/dr.good
M compiler/codegen/cg-expr.cpp
M compiler/optimizations/deadCodeElimination.cpp
M modules/internal/ChapelBase.chpl
M modules/internal/localeModels/gpu/LocaleModel.chpl
M runtime/include/chpl-gpu.h
M runtime/include/chpl-mem-array.h
M runtime/src/chpl-gpu.c
M test/gpu/native/gpuAddNums/gpuAddNums_primitive.chpl
M test/gpu/native/streamPrototype/gpuOutline.chpl
M test/gpu/native/threadBlockAndGridPrimitives.chplCompare: Comparing b521dc98825e...d89e062d78ad · chapel-lang/chapel · GitHub