Essentially, when using promoted array assignment on the GPU, the outcome doesn't match what currently happens on the CPU. From the above issue, it sounds like the CPU should be catching the behavior as an error, but since compiling for GPU is forcing --no-checks, the CPU is storming forward with behavior -- assigning the matching elements and zeroing out the unshared elements -- that is undefined according to spec Arrays — Chapel Documentation 1.30 and primer Arrays — Chapel Documentation 1.30
The GPU on the other hand is 1) assigning the matching elements, 2) zero-/garbage-/nan-assigning some middle elements near (just past?) the end of small.domain, and 3) NOOPing on most of the unshared elements
I've attached a small reproducer. I'm running a build compiled from GitHub SHA: d7664c9d81b1cb3d84. The behavior does not appear to differ whether I'm using a compiler/runtime built with CHPL_GPU_MEM_STRATEGY=unified_memory or array_on_device.
My naive expectation before re-reading the Arrays spec is that it shouldn't even be an error, but it should only assign the members where the domains have overlapping elements, and NOOP everywhere else. (Which is what happens if you small = large[small.domain]; on either the CPU or GPU.)
If a subdomain isn't given, I don't particularly care whether the missing elements are (ordered by preference) 1) NOOPed, 2) force a compiler and/or runtime error, 3) zero-assigned, as long as the behavior is consistent between CPU and GPU. As documented, I guess (2) would be the target behavior? smallToBigPromotedGPUAssign.chpl (2.0 KB)
Sorry you're running into this. As you've seen, in the absence of runtime checks, what occurs when assigning between arrays of different sizes is "undefined behavior". I agree this is confusing and it's not unreasonable to assume that in such cases, non overlapping elements would be assigned to zero or not reassigned on the left-hand side. This makes having the check enabled by default all the more important.
I also agree that, albeit undefined, it's somewhat surprising that this behavior differs between CPUs and GPUs. I'll spend some time investigating why this but to be candid, unless there's some obvious "fix" or if there's some really convincing argument for consistency, it's unlikely that we'd concern ourselves too much about this. When --no-checks is in place our goal is to err on the side of high-performance and it may be that ensuring consistency would add some overhead we'd rather avoid. That said, I haven't looked deeply into this yet so I'll investigate this and see if there is something we could do.
All that said, ultimately our goal is to do what you had as option (2): force an error by default. As an FYI: the reason we currently force --no-checks with GPU support enabled is that we currently don't have a way to halt computation from within a GPU kernel and our "admittedly more of an axe than a scalpel" kind of solution to this was to disable checks altogether. Again, apologies that you're running into issues with this, it gives us all the more motivation to lift this "forced --no-checks" issue sooner rather than later.
Catching up with this thread late in the day, and tagging onto Andy's response a bit...
It is our intention in Chapel that assignments between arrays of different shapes/sizes (or, nearly equivalently, promotions of scalar routines by arrays of different shapes/sizes) be erroneous behavior—and ideally behavior that's caught when execution-time checks are enabled. So large=small; should be an error / shouldn't be expected to work (and Andy explains why, for GPUs, we don't do so today).
If there was a specific aspect of the arrays chapter of the spec that led you to believe that it should be permitted, that would be good feedback for us to hear.
Ultimately my misunderstanding of the spec (and glossing over where it specifically says you can't do this) is from the --no-checks flag. If I'd started on CPU Chapel, it would've caught the error at compile time. Since it didn't, I just figured it was one of Chapel's many nifty intentional shorthands and ran with the bad idea. The subdomain syntax to explicitly do it the right way is still more clear than computing memcpy offsets so either way is a net win.
As far as lifting --no-checks unfortunately I don't have much to suggest. My thoughts lead towards implementing some form of bespoke signal handler that sits on top of the CUDA, ROCm, etc with GPU callbacks, but that would probably require flushing the GPU queue after every operation to ensure the relative timeliness of halt. Losing the asynchrony would probably come with a performance hit, which I know you guys are always trying to avoid. Tough puzzle!
My personal thoughts on this puzzle is that we will have to forgo timeliness of the halt. Halting from inside a kernel will turn into a message being printed out to the IO buffer of the GPU and flipping a bit that CPU can read after synchronizing the stream/context. At which point the CPU will say "Ha! There was a halt in the GPU kernel. Let me exit()". This maybe a bit naive, as I am hoping that what happens between the event that triggered the "halt" in the GPU kernel and the context/stream synchronization will not burn the house down. But as "halt" is already a severe enough end of an application, this should be acceptable.
Related question (and not really a solution, more a curiosity):
Engin, if a bounds-checking error occurs in your new "use the CPUs to simulate GPUs" mode, will that halt(), or will it inherit the GPUs --no-checks behavior? Essentially, I'm wondering whether, if someone were to debug their code in that mode, they'd catch and fix the error there before recompiling for GPUs.
That said, again, I don't really consider this a solution (i.e., I don't think we should require/expect users to do this to catch bugs if it did) and prefer your "delayed halt" proposal.
That was one of the things I had attempted in an earlier version of my branch. However, it was bumping into an issue that we also had on "proper" GPU builds. I am blanking on what that was, though. Currently, that mode also implies --no-checks.
While reviewing the PR, Andy and I had also discussed it a bit as to whether we want CHPL_GPU=cpu to be able to throw --check (better debugging support) or not (closer behavior to CHPL_GPU!=cpu). I can't recall where we stood at the end of that discussion. I am currently for the former if these were the only two options.
We have had a lot of changes in the runtime since I started working on CHPL_GPU=cpu and whatever that issue was it could have already been resolved. It might worth giving a try, but I wouldn't consider that as something that any of us should put serious time on, where we can work on the simple halt idea above. Which I suppose is where you stand, as well.
Delayed halt is pretty par for the course for GPU, we're used to not being able to check for error codes until we explicitly synchronize the device. So as long as you guys are ok with halt not being timely because it's already a severe situation, then I think this approach makes sense... Essentially just inject some sanity-check callbacks between GPU operations, or similar which would allow the streams/queues to run fully asynchronously as intended.
As an aside, this "simulate GPU on the CPU" sounds useful from a developer perspective. I vaguely recall resolving some device-side out of bounds accesses that way back when CUDA still supported -deviceemu and/or test running OpenCL on the CPU before moving to GPU.
Unfortunately, our docs from the main development branch aren't uploaded to our webhost at the moment because of some infrastructre issues we are wrestling with. But chapel/gpu.rst at main · chapel-lang/chapel · GitHub is the documentation we have for that mode.