Branch: refs/heads/main
Revision: 0f62b7c
Author: e-kayrakli
Log Message:
Merge pull request #18146 from e-kayrakli/proto-gpu
Outline order independent loops for GPU execution
This PR adds the prototype for outlining order-independent loops into GPU
kernels. The outliner fires only with CHPL_LOCALE_MODEL=gpu.
Implementation Details
- This outliner currently works only on special ddata wrappers implemented in a
test. - The "pass" runs after dead code elimination (implemented as part of DCE)
- It looks at the bodies of order-independent
CForLoops to determine inner and
outer variables.- Outer variables are turned into arguments to the GPU kernel,
- Inner variables are left alone,
- The variable that looks like the loop index is replaced by a
var fakeIndex
that's inside the kernel body and is 0. (This is a near-term follow up to
adjust)
- The outlining itself is pretty straightforward once inner and outer variables
are determined. - This PR also simplifies
PRIM_GPU_LAUNCH_KERNELto accept 4 arguments +
kernel parameters:- function name,
- grid size, (we're currently thinking all the grids will be 1D)
- block size, (we're currently thinking all the block will be 1D)
- number of kernel parameters.
- To do that, this PR also moves the kernel launch logic into runtime.
- Runtime adjustments:
- Add
chpl_gpu_launch_kernelthat thePRIM_GPU_LAUNCH_KERNELturns into. - Add
chpl_gpu_check_device_ptra basic debugging tool that checks whether
the pointer we're passing to the kernel is actually a device pointer. This
function can be removed, or can be tied to an advanced debugging flag at
some point. - Add
CUDA_CALLmacro for better error reporting.
- Add
Near-term followups
- @stonea will work on adding the index calculation logic
- @daviditen is working on adding a loop body analysis
- pass the loop length into the kernel for bounds checking
- work on the GPU locale model to allocate data on the GPU when inside
on gpu-like block. - work on using Chapel arrays inside the loop body
The effort for getting a STREAM-like benchmark running is tracked in:
https://github.com/Cray/chapel-private/issues/2296
[Reviewed by @stonea]
Test
-
standard with
CHPL_LOCALE_MODEL=flat -
some subset with
CHPL_LOCALE_MODEL=gpuModified Files:
A test/gpu/native/streamPrototype/gpuOutline.chpl
A test/gpu/native/streamPrototype/gpuOutline.compopts
A test/gpu/native/streamPrototype/gpuOutline.execenv
A test/gpu/native/streamPrototype/gpuOutline.good
A test/gpu/native/streamPrototype/gpuOutline.prediff
M compiler/AST/primitive.cpp
M compiler/codegen/cg-expr.cpp
M compiler/include/primitive_list.h
M compiler/optimizations/deadCodeElimination.cpp
M compiler/optimizations/optimizeOnClauses.cpp
M runtime/include/chpl-gpu.h
M runtime/src/chpl-gpu.c
M test/gpu/native/gpuAddNums/gpuAddNums_primitive.chpl
M test/gpu/native/threadBlockAndGridPrimitives.chplCompare: Comparing 32dd29e6af43...0f62b7c40b69 · chapel-lang/chapel · GitHub