[Chapel Merge] Add STREAM for GPU and make necessary adjustments

Branch: refs/heads/main
Revision: 90560c8
Author: e-kayrakli
Log Message:

Merge pull request #18321 from e-kayrakli/gpu-stream

Add STREAM for GPU and make necessary adjustments for it

This PR adds a full Stream benchmark for GPU.

In order to achieve that it also makes the followign adjustments to the GPU
support:

  • Adds proper grid size calculation based on loop "size" and block size.
    • This is currenlty done by passing an argument representing number of threads
      to the runtime's kernel launcher. From there, the runtime launcher does the
      computation.
    • So, this PR adjusts the runtime interface slightly, as well.
  • Adds an early return check in the gpu kernel, in case the local thread index
    is out-of-bounds for the loop.
  • Adds a --gpu-block-size compiler flag to control the block size of gpu
    kernels.
  • Adjusts the denormalize pass to avoid replacing temps used for kernel launches
    with equivalent expressions. This is done to avoid making more significant
    adjustments in the kernel launch codegen.

While there:

  • Moves the Kernel launcher called output earlier to catch fatal launch errors
    that are due to not being able to load a kernel from the fatbinary.
  • Drop an unused variable from a test.

[Reviewed by @daviditen]

Test

  • [x] test/gpu/native

  • [x] standard

    Modified Files:
    A test/gpu/native/streamPrototype/stream.chpl
    A test/gpu/native/streamPrototype/stream.compopts
    A test/gpu/native/streamPrototype/stream.execopts
    A test/gpu/native/streamPrototype/stream.good
    M compiler/include/driver.h
    M compiler/main/driver.cpp
    M compiler/optimizations/deadCodeElimination.cpp
    M compiler/passes/denormalize.cpp
    M runtime/include/chpl-gpu.h
    M runtime/src/chpl-gpu.c
    M test/gpu/native/streamPrototype/forallOverZipArray.chpl
    M util/chpl-completion.bash

    Compare: Comparing 78db7d764a59...90560c82232a · chapel-lang/chapel · GitHub