I understand that GPUization doesn't yet support atomics. But,
Here is a pattern from access-qualified GPU code that we would like to (eventually) GPU-ize via Chapel, that we've tried to model using Chapel's existing CPU-side type-qualified atomics.
CUDA sketch. producer does atomic writes, but consumer performs safe non-atomic reads.
//Atomically produce
__global void producer(int * runtimeOffsets, float * atomBuf) {
...
float myIncrement = ...;
atomicAdd(&atomBuf[someFunctionOf(runtimeOffsets[threadIdx.x])], myIncrement);
...
}
//Not-atomically consume
__global void consumer(float * atomBuf, float *outBuf) {
...
outBuf[threadIdx.x] = someOtherFunctionOf(atomBuf[threadIdx.x]);
...
}
Neither someFunctionOf or someOtherFunctionOf actually make any calls, just some global memory reads, arithmetic, and conditionals, but someFunctionOf may result in an index overlap between threads. In practice, producer is 3D (forall gridDim.z { forall gridDim.y { forall gridDim.x { atomicStuff } } }) and only contends in the X dimension, whereas consumer is 1D. 3D/2D kernels are a separate GPU wish-list item ![]()
Here's a sketch of what I've got in Chapel, which makes the consumer perform an atomic read since the entire atomBuf array is typed with atomic.
on here.gpus[0] {
var problemSize : int(64);
...
var runtimeOffsets: [problemSize] cpuOffsets.eltType;
var atomBuf: [problemSize] atomic cpuOutBuf.eltType;
var outBuf: [problemSize] cpuOutBuf.eltType;
//Atomically produce
forall x in runtimeOffsets.domain {
var myIncrement: real(32)= ...;
...
atomBuff[someFunctionOf(runtimeOffsets[x])].add(myIncrement);
}
...
//Non-atomically consume (ideally)
forall x in runtimeOffsets.domain {
...
outBuf[x] = someOtherFunctionOf(atomBuf[x].read());
...
}
...
}
Is there a way (through casts in producer, etc.) that I can instead get the CUDA-like access-qualifer semantics? Something like:
...
atomBuf: [problemSize] cpuOutBuf.eltType;
//Atomically produce
forall x in runtimeOffsets.domain {
...
ref castAddr = (atomBuf[someFunctionOf(runtimeOffsets[x])] : atomic atomBuf.eltType)
castAddr.add(myIncrement);
...
}
Not sure if the above would correctly capture the alias of an address in atomBuf, or rather the stack-temporary pre-cast portion of the RHS? (Essentially without C pointers all the way through, I'm not sure when Chapel's ref initializer starts using pointer semantics on the RHS?)
If it does capture the address in atomBuf, then the above would be similar to how we achieve this pattern in SYCL, which is fine):
...
cl::sycl::atomic<float> cast_addr{cl::sycl::global_ptr<float>{&atomBuf[someFunctionOf(runtimeOffsets[x])]}};
cast_addr.fetch_add(myIncrement);
...
Alternatively, could we "cast away" the atomic type-qualifier in consumer to just make an "unsafe" (but really it's fine) read from atomBuf's global memory?
In the short term, I'd also be open to just making some kind of extern call to CUDA's existing atomicAdd (and whatever the AMD HIP equivalent is), if it would allow GPUization of the rest of the kernel(s) in Chapel.