[Chapel Merge] Outline order independent loops for GPU execution

e-kayrakli1 · August 4, 2021, 7:08pm

Branch: refs/heads/main
Revision: 0f62b7c
Author: e-kayrakli
Log Message:

Merge pull request #18146 from e-kayrakli/proto-gpu

Outline order independent loops for GPU execution

This PR adds the prototype for outlining order-independent loops into GPU
kernels. The outliner fires only with CHPL_LOCALE_MODEL=gpu.

This outliner currently works only on special ddata wrappers implemented in a
test.
The "pass" runs after dead code elimination (implemented as part of DCE)
It looks at the bodies of order-independent CForLoops to determine inner and
outer variables.
- Outer variables are turned into arguments to the GPU kernel,
- Inner variables are left alone,
- The variable that looks like the loop index is replaced by a var fakeIndex
  that's inside the kernel body and is 0. (This is a near-term follow up to
  adjust)
The outlining itself is pretty straightforward once inner and outer variables
are determined.
This PR also simplifies PRIM_GPU_LAUNCH_KERNEL to accept 4 arguments +
kernel parameters:
- function name,
- grid size, (we're currently thinking all the grids will be 1D)
- block size, (we're currently thinking all the block will be 1D)
- number of kernel parameters.
To do that, this PR also moves the kernel launch logic into runtime.
Runtime adjustments:
- Add chpl_gpu_launch_kernel that the PRIM_GPU_LAUNCH_KERNEL turns into.
- Add chpl_gpu_check_device_ptr a basic debugging tool that checks whether
  the pointer we're passing to the kernel is actually a device pointer. This
  function can be removed, or can be tied to an advanced debugging flag at
  some point.
- Add CUDA_CALL macro for better error reporting.

@stonea will work on adding the index calculation logic
@daviditen is working on adding a loop body analysis
pass the loop length into the kernel for bounds checking
work on the GPU locale model to allocate data on the GPU when inside on gpu-like block.
work on using Chapel arrays inside the loop body

The effort for getting a STREAM-like benchmark running is tracked in:
https://github.com/Cray/chapel-private/issues/2296

[Reviewed by @stonea]