Since GPU doesn't yet support 2D/3D kernels and I'm porting a 3D one, I've linearized my index space into a single forall loop over ~5.5 billion GPU threads / unique linear IDs. I de-linearize the ID back into CUDA threadIdx/blockIdx variables so that I can keep mapping one axis of the problem onto each dimension.
var workSize : int(64) = ((isXBlock*isXGrid)*(isYBlock*isYGrid)*(isZBlock*isZGrid));
forall linear_id in 0..<workSize {
var tid = get_ND_ID((isXGrid, isYGrid, isZGrid), (isXBlock, isYBlock, isZBlock), linear_id);
... }
According to the CUDA SDK's deviceQuery
, my card (RTX 3090) can support way more than 5.5B threads.
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
But according to Nsight Compute, I'm only actually enqueuing ~1.2B threads, which leads to incomplete data --> lots of NaNs from a subsequent kernel.
A CPU version of the same kernel runs the correct number of threads and produces correct results.
Oddly, when I look at my next-largest problem size (in terms of the ID space, not work-per-thread), it successfully enqueues more threads (~1.6B), enough to match the assigned problem size of 1.6B.
In terms of moving forward... where/how would I look into the generated intermediate CUDA? and/or how is it deciding how many threads of the forall
to turn into a grid/block?