Mixing atomic add with non-atomic read? (and towards GPU atomics)

I understand that GPUization doesn't yet support atomics. But,

Here is a pattern from access-qualified GPU code that we would like to (eventually) GPU-ize via Chapel, that we've tried to model using Chapel's existing CPU-side type-qualified atomics.

CUDA sketch. producer does atomic writes, but consumer performs safe non-atomic reads.

//Atomically produce
__global void producer(int * runtimeOffsets, float * atomBuf) {
  ...
  float myIncrement = ...;
  atomicAdd(&atomBuf[someFunctionOf(runtimeOffsets[threadIdx.x])], myIncrement);
  ...
}

//Not-atomically consume
__global void consumer(float * atomBuf, float *outBuf) {
  ...
  outBuf[threadIdx.x] = someOtherFunctionOf(atomBuf[threadIdx.x]);
  ...
}

Neither someFunctionOf or someOtherFunctionOf actually make any calls, just some global memory reads, arithmetic, and conditionals, but someFunctionOf may result in an index overlap between threads. In practice, producer is 3D (forall gridDim.z { forall gridDim.y { forall gridDim.x { atomicStuff } } }) and only contends in the X dimension, whereas consumer is 1D. 3D/2D kernels are a separate GPU wish-list item :smile:

Here's a sketch of what I've got in Chapel, which makes the consumer perform an atomic read since the entire atomBuf array is typed with atomic.

on here.gpus[0] {
  var problemSize : int(64);
  ...
  var runtimeOffsets: [problemSize] cpuOffsets.eltType;
  var atomBuf: [problemSize] atomic cpuOutBuf.eltType;
  var outBuf: [problemSize] cpuOutBuf.eltType;
  //Atomically produce
  forall x in runtimeOffsets.domain {
    var myIncrement: real(32)= ...;
	...
	atomBuff[someFunctionOf(runtimeOffsets[x])].add(myIncrement);
  }
  ...
  //Non-atomically consume (ideally)
  forall x in runtimeOffsets.domain {
    ...
    outBuf[x] = someOtherFunctionOf(atomBuf[x].read());
	...
  }
  ...
}

Is there a way (through casts in producer, etc.) that I can instead get the CUDA-like access-qualifer semantics? Something like:

  ...
  atomBuf: [problemSize] cpuOutBuf.eltType;
  //Atomically produce
  forall x in runtimeOffsets.domain {
    ...
	ref castAddr = (atomBuf[someFunctionOf(runtimeOffsets[x])] : atomic atomBuf.eltType)
	castAddr.add(myIncrement);
	...
  }

Not sure if the above would correctly capture the alias of an address in atomBuf, or rather the stack-temporary pre-cast portion of the RHS? (Essentially without C pointers all the way through, I'm not sure when Chapel's ref initializer starts using pointer semantics on the RHS?)

If it does capture the address in atomBuf, then the above would be similar to how we achieve this pattern in SYCL, which is fine):

  ...
  cl::sycl::atomic<float> cast_addr{cl::sycl::global_ptr<float>{&atomBuf[someFunctionOf(runtimeOffsets[x])]}};
  cast_addr.fetch_add(myIncrement);
  ...

Alternatively, could we "cast away" the atomic type-qualifier in consumer to just make an "unsafe" (but really it's fine) read from atomBuf's global memory?

In the short term, I'd also be open to just making some kind of extern call to CUDA's existing atomicAdd (and whatever the AMD HIP equivalent is), if it would allow GPUization of the rest of the kernel(s) in Chapel.

Thanks for the detailed description of your use case Paul. As for the general request of supporting atomic operations on non-atomic types and vice versa, this has been discussed in the past. Elliot would know the most about the matter, and he'll capture his thoughts in an issue soon. But for the short term, your suggestion of providing wrappers is something we can do for sure. I created Provide atomic operation wrappers to be used in GPU kernels · Issue #22155 · chapel-lang/chapel · GitHub to track that request.

For other points:

In practice, producer is 3D (forall gridDim.z { forall gridDim.y { forall gridDim.x { atomicStuff } } } ) and only contends in the X dimension, whereas consumer is 1D. 3D/2D kernels are a separate GPU wish-list item :smile:

This is something came up for us in the past, too. But we haven't spent too much time on it. I've captured our previous recorded discussions and the general feature request under Support multidimensional kernel launches · Issue #22152 · chapel-lang/chapel · GitHub. If you have further comments, please share there.

Is there a way (through casts in producer , etc.) that I can instead get the CUDA-like access-qualifer semantics?

None that I am aware of, I am afraid.

Not sure if the above would correctly capture the alias of an address in atomBuf , or rather the stack-temporary pre-cast portion of the RHS? (Essentially without C pointers all the way through, I'm not sure when Chapel's ref initializer starts using pointer semantics on the RHS?)

It looks like you can't assign the result of a cast to a ref today. The cast generates a const. So you'd need to assign it to a const ref which precludes you from doing add anyways.

Hi Paul,

In the short term, I'd also be open to just making some kind of extern call to CUDA's existing atomicAdd (and whatever the AMD HIP equivalent is), if it would allow GPUization of the rest of the kernel(s) in Chapel.

I would expect that we can support that although calling extern functions from within a GPUizable loop today is a bit clunky today. To do this:

  • define your CUDA function in an extern block and make it __device__ static inline
  • create an equivalent host version of the function
  • declare an extern proc for the function in Chapel
  • add pragma "codegen for GPU" above your declaration

You need the host function even if your loop will never run on the CPU as (of today) Chapel generates code to execute both on the CPU and on the GPU and does a runtime check to see if you're currently on a GPU locale (and launches the kernel if that's the case).

We have a Github issue about making this more elegant: Calling extern functions from GPU kernels · Issue #21906 · chapel-lang/chapel · GitHub

As far as doing this to call out to CUDA's atomicAdd I tried the following and it appears to work:

use GPU, CTypes;

extern {
  #include <cuda.h>
  #include <cuda_runtime.h>
  #include <cuda_runtime_api.h>

  __device__ static inline void atomicIncrement(float *atomBuf, int idx, float myIncrement) {
    atomicAdd(&atomBuf[idx], myIncrement);
  }

  __host__ static inline void atomicIncrement(float *atomBuf, int idx, float myIncrement) {}
}

config param N = 5;

proc main() {
  pragma "codegen for GPU"
  extern proc atomicIncrement(atomBuf : c_ptr(c_float), idx : c_int, myIncrement : c_float);

  on here.gpus[0] {
    var A : [0..<N] real(32);
    var cPtrToA = c_ptrTo(A);

    foreach i in 0..<N {
      assertOnGpu();
      atomicIncrement(cPtrToA, 0, 1.0);
    }
    writeln(A);
  }
}

I also ran into an issue with using c_ptrTo in the loop. I worked around it by declaring cPtrToA outside the loop and I've created a separate GitHub issue to track that: https://github.com/chapel-lang/chapel/issues/22151.

Thanks for the workaround, this doesn't look too bad at all to use in the interim. Will give it a shot!

Hi Paul,

As an FYI, I recently submitted a PR that adds a number of gpuAtomic functions: https://github.com/chapel-lang/chapel/pull/22241. These functions call out to the equivalent functions in CUDA or HIP to perfrom the atomic op.

These are currently undocumented and long-term aren't something we'd intend users to call directly (I imagine they'd be called internally if the users used atomic variables within a kernel). However, in the meantime you might find it more convenient to call out to these than to an extern proc.

If you don't think the operations I added in that PR work for your use case let me know and I can see if I can add one that would.

Thanks,
-Andy

Thanks for the ping @stonea! The externs are working nicely for me, but I will keep these in mind to port to when it comes time to test the AMD backend. (Hopefully 7900 XTX support comes in ROCm 5.6, since 5.5 came and went without it :sob: )

One of our students is going to need to port a Thrust reduction-max soon... He might be interested.