problem building Chapel on WSL ubuntu with gpu

Hi Brad,
Could you please give me your advice. I am trying to see if GPU can boost
performance of typical numerical pieces we are having. See attached, where
I am putting together a little piece representing what we are doing, it has
a somewhat typical proportion of data moved to be done and calcs. I am
trying to make slicing but it does not make any improvement, vise versa,
slows down. Am I doing it right ? I hope it won't take you more than half a
minute to look at the code.

Does it mean this kind of calcs should stay on CPUs, or better GPU
(hardware, since data movement from/to CPU-GPU is essential) can be tried
???

Thanks a lot in advance !!!

Igor

(attachments)

dragg.chpl (2.63 KB)

Hi @iarshavsky

I think it would be good to create a new topic for this discussion so that it doesn't get lost in this one about build issues, which people may have muted if that wasn't of interest to them.

I don't have a lot of time to look at this at the moment, but have a few questions about the code:

  • Are you using slicing because there's not enough memory on the GPUs to store all of the array data at once?

  • What values of n and sn are you running with in pratice?

  • The first = true case looks like it is incomplete as currently written, is that correct? (it copies the arrays but then does nothing with them?

I'm not nearly as experienced with GPU programming as others on the team, but the main thing that catches my eye is the use of a coforall within the on block that targets the GPU. Generally, I believe you want GPU computations to use data parallel computations that can scale to arbitrary numbers of cores rather than trying to create explicit tasks for them. As an example, most of the array statements within the body of the coforall are data parallel and (I believe) should make good use of the GPU resources (for a large enough slice) without the need for additional explicit parallel tasks as introduced by the coforall.

For that reason, I'd suggest either:

  • changing the coforall to a for and maybe increasing your slice slice if you were trying to express things on a core-by-core basis
  • if you have multiple GPUs, moving the coforall outside of the on-clause and using it to deal different slices of the data set across multiple GPUs in parallel

Those are my quick reactions, though I'm sure there's a lot about this computation (and GPU programming in Chapel) that I'm familiar with. Hopefully others can help out as well (where, again, a new topic may help with that).

-Brad

1 Like

@iarshavsky : This post that @e-kayrakli recently made on a sibling topic may also be useful if you haven't seen it: gpuClock() only returns 0 - #13 by e-kayrakli

-Brad

Thank you, Brad,
Quick answers to your questions:

  1. first flag is used to move constant data, to begin with, and I exclude
    this step from time measuring
  2. coforall is an attempt to overlap data passing and calculations
  3. n=20000, sn=5000, if I make them 10 times larger GPU becomes slightly
    faster then CPU, about the same, actually.
  4. I have 1 GPU and 6 cores

Thanks again,
Igor

Hi Igor —

I'm cleaning up my inbox and realized that I never replied to this, though I think that you forked this question off to CHAPEL performance test on WSL ubuntu and that it was largely resolved satisfactorily there, is that right? If not and there's something else that needs to be handled on this thread, please let us know.

-Brad

Correct. Thank you so much !

1 Like