How to use the prototype GPU codegen feature in 1.24?

Hello!

This is Akihiro@GATech.

I'm very excited to see the prototype GPU codegen feature released in 1.24! Congratulations! I'd like to run the gpuAddNums.chpl example on one of my GPU machines, but looks like the compiler does not generate a fatbin file. Any suggestions?

For more details, please see below:

  1. I'm trying to compile this program: chapel/gpuAddNums.chpl at master · chapel-lang/chapel · GitHub

  2. I gave the compiler options in gpuAddNums.compopts.

  3. The only change I made was to change the compute capability option from sm_60 -> sm_61. However, regardless of this, the compiler gives me this warning:
    warning: argument unused during compilation: '--cuda-gpu-arch=sm_61'

  4. To build chapel, I used the tar.gz file and here is the output of my printchplenv:
    CHPL_TARGET_PLATFORM: linux64
    CHPL_TARGET_COMPILER: gnu
    CHPL_TARGET_ARCH: x86_64
    CHPL_TARGET_CPU: native
    CHPL_LOCALE_MODEL: flat
    CHPL_COMM: none
    CHPL_TASKS: qthreads
    CHPL_LAUNCHER: none
    CHPL_TIMERS: generic
    CHPL_UNWIND: none
    CHPL_MEM: jemalloc
    CHPL_ATOMICS: cstdlib
    CHPL_GMP: bundled
    CHPL_HWLOC: bundled
    CHPL_REGEXP: re2
    CHPL_LLVM: bundled
    CHPL_AUX_FILESYS: none

  5. Looks like the NVPTX backend is enabled in the bundled LLVM:
    $CHPL_HOME/third-party/llvm/install/linux64-x86_64-gnu/bin/llc --version LLVM (http://llvm.org/):
    LLVM version 11.0.1
    Optimized build.
    Default target: x86_64-unknown-linux-gnu
    Host CPU: skylake

Registered Targets:
aarch64 - AArch64 (little endian)
aarch64_32 - AArch64 (little endian ILP32)
aarch64_be - AArch64 (big endian)
arm64 - ARM64 (little endian)
arm64_32 - ARM64 (little endian ILP32)
nvptx - NVIDIA PTX 32-bit
nvptx64 - NVIDIA PTX 64-bit
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64

Please let me know if you need more information!

Thanks,

Akihiro

Hi Akihiro,

Thanks for trying out our GPU codegen feature! Could you try setting the environment variable CHPL_LOCALE_MODEL to gpu instead of flat? Let me know if the fatbin file is still not being generated.

Thanks,
Sarah

Hi Sarah,

Thank you for your prompt reply. With CHPL_LOCALE_MODEL=gpu, I was able to generate the fatbin file and run it on my GPU!

Just checking, is gpuAddNums.chpl the only GPU example so far?

Thanks,

Akihiro

Great! Yes, that's the only GPU example we have so far. We're currently working on more complicated GPU examples for the near future.

Sarah

Great, Thank you!

Best regards,

Akihiro

Hi Sarah,

Sorry, I've got another problem when linking my Chapel module that has an external C function with CHPL_GPU_LOCALE=gpu Could you take a look at it when you get a chance?

I created a simple program (w/o the GPU code generation) that reproduces the problem:

// CPrint.chpl
module CPrint {
    extern proc printHello();
}
// test.chpl
use CPrint;
printHello();
// test.h
void printHello();
// test.c
#include <stdio.h>
void printHello() { printf("Hello from C\n"); }

With CHPL_GPU_LOCALE=flat, I was able to compile and run it:

$ CHPL_LOCALE_MODEL=flat chpl --llvm --ccflags --cuda-gpu-arch=sm_60 --savec=tmp test.h test.o test.chpl -L/usr/local/cuda/lib64
warning: argument unused during compilation: '--cuda-gpu-arch=sm_60'
$ ./test
Hello from C

However, with CHPL_GPU_LOCALE=gpu, I got an linker error saying that the external C function is not found:

CHPL_LOCALE_MODEL=gpu chpl --llvm --ccflags --cuda-gpu-arch=sm_60 --savec=tmp test.h test.o test.chpl -L/usr/local/cuda/lib64
warning: Unknown CUDA version. version.txt: 10.2.89. Assuming the latest supported version 10.1
warning: Unknown CUDA version. version.txt: 10.2.89. Assuming the latest supported version 10.1
tmp/chpl__module.o: In function `chpl__init_test':
root:(.text+0x128d10): undefined reference to `printHello()'
clang-11: error: linker command failed with exit code 1 (use -v to see invocation)
error: Make Binary - Linking

I gave it a try to see the linker command by adding --ldflags -v, but apparently 1) test.o, which should include the body of printHello(), is linked, and 2) there is no significant difference between the flat and gpu versions (Looks like -lnuma is added to the gpu version though)

Please let me know if you have any suggestions.

Best regards,

Akihiro

Hi Akihiro,

I'm able to reproduce the issue and will work on fixing it. Thanks for bringing it up. In the meantime, if it helps, you should be able to run the following:

extern {
  #include <stdio.h>
  void printHello() { printf("Hello from C\n"); }
}
printHello();

with both CHPL_LOCALE_MODEL=flat and CHPL_LOCALE_MODEL=gpu

Hi Sarah,

Thank you very much for reproducing the issue and working on fixing it! Sure, for now, I'll do the workaround. Thank you!

Best regards,

Akihiro

Hi Akihiro,

I'm still working on a fix but I have another workaround, if you modify your test.h file to

// test.h
#ifdef __cplusplus
extern "C" {
#endif

void printHello();

#ifdef __cplusplus
}
#endif

you should be able to compile test.chpl with CHPL_LOCALE_MODEL=gpu (i.e. CHPL_LOCALE_MODEL=gpu chpl --llvm --ccflags --cuda-gpu-arch=sm_60 --savec=tmp test.h test.o test.chpl -L/usr/local/cuda/lib64)

Hi Sarah,

Thank you very much! It works! I think that's the solution I was looking for.

Best regards,

Akihiro

Hi Akihiro,

I recently implemented a fix for this issue on Chapel's master branch. You should now be able to compile test.chpl with CHPL_LOCALE_MODEL=gpu without using a workaround. Let me know if you run into any problems.

Thanks,
Sarah

Hi Sarah,

That's great to hear! Thank you very much.

Best regards,

Akihiro