[Chapel Merge] Outline order independent loops for GPU execution

Branch: refs/heads/main
Revision: 0f62b7c
Author: e-kayrakli
Log Message:

Merge pull request #18146 from e-kayrakli/proto-gpu

Outline order independent loops for GPU execution

This PR adds the prototype for outlining order-independent loops into GPU
kernels. The outliner fires only with CHPL_LOCALE_MODEL=gpu.

Implementation Details

  • This outliner currently works only on special ddata wrappers implemented in a
    test.
  • The "pass" runs after dead code elimination (implemented as part of DCE)
  • It looks at the bodies of order-independent CForLoops to determine inner and
    outer variables.
    • Outer variables are turned into arguments to the GPU kernel,
    • Inner variables are left alone,
    • The variable that looks like the loop index is replaced by a var fakeIndex
      that's inside the kernel body and is 0. (This is a near-term follow up to
      adjust)
  • The outlining itself is pretty straightforward once inner and outer variables
    are determined.
  • This PR also simplifies PRIM_GPU_LAUNCH_KERNEL to accept 4 arguments +
    kernel parameters:
    • function name,
    • grid size, (we're currently thinking all the grids will be 1D)
    • block size, (we're currently thinking all the block will be 1D)
    • number of kernel parameters.
  • To do that, this PR also moves the kernel launch logic into runtime.
  • Runtime adjustments:
    • Add chpl_gpu_launch_kernel that the PRIM_GPU_LAUNCH_KERNEL turns into.
    • Add chpl_gpu_check_device_ptr a basic debugging tool that checks whether
      the pointer we're passing to the kernel is actually a device pointer. This
      function can be removed, or can be tied to an advanced debugging flag at
      some point.
    • Add CUDA_CALL macro for better error reporting.

Near-term followups

  • @stonea will work on adding the index calculation logic
  • @daviditen is working on adding a loop body analysis
  • pass the loop length into the kernel for bounds checking
  • work on the GPU locale model to allocate data on the GPU when inside on gpu-like block.
  • work on using Chapel arrays inside the loop body

The effort for getting a STREAM-like benchmark running is tracked in:
https://github.com/Cray/chapel-private/issues/2296

[Reviewed by @stonea]

Test

  • standard with CHPL_LOCALE_MODEL=flat

  • some subset with CHPL_LOCALE_MODEL=gpu

    Modified Files:
    A test/gpu/native/streamPrototype/gpuOutline.chpl
    A test/gpu/native/streamPrototype/gpuOutline.compopts
    A test/gpu/native/streamPrototype/gpuOutline.execenv
    A test/gpu/native/streamPrototype/gpuOutline.good
    A test/gpu/native/streamPrototype/gpuOutline.prediff
    M compiler/AST/primitive.cpp
    M compiler/codegen/cg-expr.cpp
    M compiler/include/primitive_list.h
    M compiler/optimizations/deadCodeElimination.cpp
    M compiler/optimizations/optimizeOnClauses.cpp
    M runtime/include/chpl-gpu.h
    M runtime/src/chpl-gpu.c
    M test/gpu/native/gpuAddNums/gpuAddNums_primitive.chpl
    M test/gpu/native/threadBlockAndGridPrimitives.chpl

    Compare: Comparing 32dd29e6af43...0f62b7c40b69 · chapel-lang/chapel · GitHub