I understand that GPUization doesn't yet support atomics. But,
Here is a pattern from access-qualified GPU code that we would like to (eventually) GPU-ize via Chapel, that we've tried to model using Chapel's existing CPU-side type-qualified atomics.
CUDA sketch. producer
does atomic writes, but consumer
performs safe non-atomic reads.
//Atomically produce
__global void producer(int * runtimeOffsets, float * atomBuf) {
...
float myIncrement = ...;
atomicAdd(&atomBuf[someFunctionOf(runtimeOffsets[threadIdx.x])], myIncrement);
...
}
//Not-atomically consume
__global void consumer(float * atomBuf, float *outBuf) {
...
outBuf[threadIdx.x] = someOtherFunctionOf(atomBuf[threadIdx.x]);
...
}
Neither someFunctionOf
or someOtherFunctionOf
actually make any calls, just some global memory reads, arithmetic, and conditionals, but someFunctionOf
may result in an index overlap between threads. In practice, producer
is 3D (forall gridDim.z { forall gridDim.y { forall gridDim.x { atomicStuff } } }
) and only contends in the X dimension, whereas consumer
is 1D. 3D/2D kernels are a separate GPU wish-list item
Here's a sketch of what I've got in Chapel, which makes the consumer perform an atomic read since the entire atomBuf array is typed with atomic
.
on here.gpus[0] {
var problemSize : int(64);
...
var runtimeOffsets: [problemSize] cpuOffsets.eltType;
var atomBuf: [problemSize] atomic cpuOutBuf.eltType;
var outBuf: [problemSize] cpuOutBuf.eltType;
//Atomically produce
forall x in runtimeOffsets.domain {
var myIncrement: real(32)= ...;
...
atomBuff[someFunctionOf(runtimeOffsets[x])].add(myIncrement);
}
...
//Non-atomically consume (ideally)
forall x in runtimeOffsets.domain {
...
outBuf[x] = someOtherFunctionOf(atomBuf[x].read());
...
}
...
}
Is there a way (through casts in producer
, etc.) that I can instead get the CUDA-like access-qualifer semantics? Something like:
...
atomBuf: [problemSize] cpuOutBuf.eltType;
//Atomically produce
forall x in runtimeOffsets.domain {
...
ref castAddr = (atomBuf[someFunctionOf(runtimeOffsets[x])] : atomic atomBuf.eltType)
castAddr.add(myIncrement);
...
}
Not sure if the above would correctly capture the alias of an address in atomBuf
, or rather the stack-temporary pre-cast portion of the RHS? (Essentially without C pointers all the way through, I'm not sure when Chapel's ref
initializer starts using pointer semantics on the RHS?)
If it does capture the address in atomBuf
, then the above would be similar to how we achieve this pattern in SYCL, which is fine):
...
cl::sycl::atomic<float> cast_addr{cl::sycl::global_ptr<float>{&atomBuf[someFunctionOf(runtimeOffsets[x])]}};
cast_addr.fetch_add(myIncrement);
...
Alternatively, could we "cast away" the atomic type-qualifier in consumer
to just make an "unsafe" (but really it's fine) read from atomBuf
's global memory?
In the short term, I'd also be open to just making some kind of extern
call to CUDA's existing atomicAdd
(and whatever the AMD HIP equivalent is), if it would allow GPUization of the rest of the kernel(s) in Chapel.