gpuClock() only returns 0

The program below,

// ===================================================================
// ==> mansum-g: manual distribution of the sum among available
// gpu cores. This program *overwrites* A[1],A[1+m],A[1+2m], ...
// ===================================================================
use Random;
use GPU;
config const Nn = 1_600_000;
on here.gpus[0] {
   const ng = Nn;
   var A: [1..ng] real;
   fillRandom(A,0);
   const npu = 3200;            // the number of GPU cores needs to be supplied
   assert( ng % npu == 0);   // make sure we can distribute the array 
   var m = ng/npu;               // # of elements to be summed by each core
   var start = gpuClock();
   @assertOnGpu
   forall pu in 0..npu-1 do {
      var beg = pu*m + 1;
      var end = (pu+1)*m;
      for i in beg+1..end do {
         A[beg] += A[i];
      }
   }
   var end = gpuClock();
   writeln("start = ",start," end = ",end);
   var sum = 0.0;
   for pu in 0..npu-1 do {
      var beg = pu*m + 1;
      sum += A[beg];
   }
   writeln("sum = ",sum);
}

prints "start = 0, end = 0", when I expected it to time the bracketed forall loop. I would appreciate if somebody can explain why gpuClock() is not working.

Cheers

Nelson

Hi Nelson —

I don't have expertise here, but since it's the weekend, perhaps a quick response is better than a well-informed one (?).

Taking a quick look at the documentation for gpuClock, the phrase "within a GPU enabled loop" is jumping out at me. And checking for existing uses of it, I'm finding this test, which calls the routine within a foreach loop within a GPU on-clause, which is presumably running on the GPU.

So my guess would be that this routine may only return meaningful results when called within a GPU-enabled foreach or forall loop, and to that end, I would be curious whether you get different results if you move your calls to just within the forall:

   forall pu in 0..npu-1 do {
     var start = gpuClock();
      var beg = pu*m + 1;
      var end = (pu+1)*m;
      for i in beg+1..end do {
         A[beg] += A[i];
      }
     var end = gpuClock();
   }

Of course, that leads to the challenge of how to inspect the results since I believe (but am not 100% sure) that putting a writeln() within the forall would break its GPU eligibility. So my thought would be to declare an array outside of the loop with npu entries and to store the diffs, or the start/stop values, into that array and then print it outside the loop?

I expect someone more expert in this routine will be able to improve upon my guesswork here next week (I had a few loose ends to tie up this morning). If my guess is correct, it also makes me wonder whether there's more we could do to help make users aware of when they're using it outside of a GPU loop rather than just returning 0...

Hopefully this may help you move forward a bit,
-Brad

Thanks Brad! It worked. Here is the corrected version:

// ===================================================================
// ==> mansum-g: manual distribution of the sum among available
// gpu cores. This program *overwrites* A[1],A[1+m],A[1+2m], ...
// ===================================================================
use Random;
use GPU;
config const Nn = 1_600_000;
on here.gpus[0] {
   const ng = Nn;
   var A: [1..ng] real;
   fillRandom(A,0);
   const npu = 3200;         // the number of GPU cores needs to be supplied
   assert( ng % npu == 0);   // make sure we can distribute the array 
   var m = ng/npu;           // # of elements to be summed by each core
   var diff: [0..npu-1] uint;
   @assertOnGpu
   forall pu in 0..npu-1 do {
      var ist = gpuClock();
      var beg = pu*m + 1;
      var end = (pu+1)*m;
      for i in beg+1..end do {
         A[beg] += A[i];
      }
      var ien = gpuClock();
      diff[pu] = ien - ist;
   }
   var total = + reduce diff;
   writeln("total in seconds = ",(total:real)/(gpuClocksPerSec(0):real));
   var sum = 0.0;
   for pu in 0..npu-1 do {
      var beg = pu*m + 1;
      sum += A[beg];
   }
   writeln("sum = ",sum);
}

Now it prints

total in seconds = 1189.05
sum = 7.99828e+05

which suggests that I am not getting the units right (maybe a factor of 1.0e6), but that is another story. Many thanks!

1 Like

Hi Nelson,

gpuClock is very tricky to use. We provide that method to have something similar to CUDA and HIP's clock() function on the device.

I think you've got most of the trickiness handled already, but + reduce diff looks wrong to me. gpuClock measures clock cycles spent on each parallel processor on the GPU. So, the measurements you are getting in diff are actually happening simultaneously. They are not additive. I think you should divide total by npu to get what you are looking for.

Meanwhile, it looks like you are measuring the performance of the loop in full. So, why not measure the time spent for the full loop instead? Namely, do what you were doing in your first post, but instead of gpuClock, use stopwatch?

   var s: stopwatch;
   s.start()
   @assertOnGpu
   forall pu in 0..npu-1 do {
      var beg = pu*m + 1;
      var end = (pu+1)*m;
      for i in beg+1..end do {
         A[beg] += A[i];
      }
   }
   s.stop();
   writeln("Total seconds: ", s.elapsed());

I would recommend gpuClock for more nitty-gritty optimization scenarios where you are trying to profile a big kernel in order to find the slowest parts, for example.

Engin

1 Like

Add a warning when `gpuClock` is called from the host by e-kayrakli · Pull Request #27022 · chapel-lang/chapel · GitHub adds a warning for this case.

Engin

1 Like

Hi Engin: many thanks for the suggestion. First, I will divide by npu to have a feeling of how the gpu is performing. As for your last suggestion, I started thinking about the problem of timing the gpu almost exactly, if not exactly, as you suggested. When I do this, however, the reported elapsed time is considerably longer (in my computers) than that with a straightforward "+ reduce A" in the cpu, and this sounded strange to me. I am not clear -- and I apologize for my ignorance -- if the calls to stopwatch run in the gpu or not; if somehow control has to be sent back to the cpu for the time to be measured and this adds extra elapsed time, etc..

When I get back home later I will try to run a few more tests and will report them, perhaps with a table summarizing what I got. Many thanks again,

Nelson

Hi Nelson,

That's curious. Seeing more results can help me understand the issue a bit better.

As a reflex, though, I can see the potential for a big difference between + reduce A and forall pu in 0..npu-1, where the former will use 1 GPU thread per element, compared to the latter where a total of npu threads will be used. Which one should result in good performance is a bit hard to tell. But note that + reduce A should perform relatively well for a reduction. And another note is that you'd need at least thousands if not tens of thousands of threads on a GPU to make it run efficiently. I see that you are hard-wiring npu to 3200, why is that the case? GPU cores thrive under oversubscription. If you are trying to use a thread per core, that's unlikely to get you the best hardware utilization. Was there a case where you were unhappy with + reduce A's performnace that you'd fall back to writing your own loop?

Engin

(Meanwhile, I also created How should `gpuClock` behave when it is used outside of its scope? · Issue #27026 · chapel-lang/chapel · GitHub to discuss how gpuClock should behave under some edge cases)

Hi again Engin. My experiments were based on the curiosity about how much faster I could get using my gpu over a 16-core cpu. For the time being I will show 3 results: 1) + reduce running on the cpu -- redsum.chpl; 2) manually dividing the sum over npu cores in the gpu (clocking with gpuClock) -- mansum-g.chpl; and 3) manually dividing the sum over npu cores in the gpu (clocking with stopwatch) -- mansum.gsw. The summary table is the following:

redsum: sum = 8001547.9314 took 0.00205700 s
mansum-g sum = 8001547.9314 took 1.10298875 s
mansum-gsw sum = 8001547.9314 took 0.02228300 s

my gpu is a NVIDIA GeForce RTX 3060
my cpu is
CPU:
Info: 16-core (8-mt/8-st) model: 12th Gen Intel Core i9-12900K bits: 64
type: MST AMCP smt: enabled arch: Alder Lake rev: 2 cache: L1: 1.4 MiB
L2: 14 MiB L3: 30 MiB

I figured I have ~ 3584 cores in the gpu. I used npu=3200 so that --Nn=16000000 divided by 3200 will sum 5000 elements on each core (hopefully :slight_smile: )

If one believes the time measurements to be correct, summing in the cpu is ~10 times faster than summing in the gpu (at best). It is hard for me to understand the difference in the results between mansum-g.chpl and mansum-gsw.chpl. Although this may be too much to report here, I am listing the three programs below. They all use a one-line defNn.chpl which defines Nn, and qdran.chpl which is a simple implementation of the quick-and-dirty random number generator from Numerical Recipes. These latter two files are not listed. Thanks again for the answers!

// ===================================================================
// ==> redsum: automatic distribution of the sum among available
// processing units with reduce
// ===================================================================
use qdran;          // quick and dirty random number generator from
                           // numerical recipes
use defNn;          // defines Nn
use Time;
// -------------------------------------------------------------------
// fill A with a simple random (serial) generator
// -------------------------------------------------------------------
var A: [1..Nn] real;
seeqd(0);
for i in 1..Nn do A[i] = ranqd();
var runtime: stopwatch;
runtime.start();
var sum = + reduce A;
runtime.stop();
writef("sum = %12.4dr  took %12.8dr s\n",sum, runtime.elapsed());
// ===================================================================
// ==> mansum-g: manual distribution of the sum among available
// gpu cores. Timing with gpuClock()
// ===================================================================
use qdran;          // quick and dirty random number generator from
                           // numerical recipes
use defNn;          // defines Nn
use GPU;
var A: [1..Nn] real;
seeqd(0);
for i in 1..Nn do A[i] = ranqd();
on here.gpus[0] {
   const ng = Nn;
   var ah = A;
   const npu = 3200;         // the number of GPU cores needs to be supplied
   assert( ng % npu == 0);   // make sure we can distribute the array 
   var m = ng/npu;           // # of elements to be summed by each core
   var psum: [0..npu-1] real = 0.0;
   var diff: [0..npu-1] uint;
   @assertOnGpu
   forall pu in 0..npu-1 do {
      var ist = gpuClock();
      var beg = pu*m + 1;
      var end = (pu+1)*m;
      for i in beg..end do {
         psum[pu] += ah[i];
      }
      var ien = gpuClock();
      diff[pu] = ien - ist;
   }
   var sum = 0.0;
   for pu in 0..npu-1 do {
      var ist = gpuClock();
      sum += psum[pu];
      var ien = gpuClock();
      diff[pu] += (ien - ist);
   }
   var avg = (+ reduce diff)/npu;
   writef("sum = %12.4dr  took %12.8dr s\n", sum,(avg:real(64))/(gpuClocksPerSec(0):real(64)));
}
// ===================================================================
// ==> mansum-gsw: manual distribution of the sum among available
// gpu cores. uses Time.stopwatch
// ===================================================================
use qdran;                    // quick and dirty random number generator
                              // from Numerical Recipes
use defNn;                    // defines Nn
use GPU;
use Time;
var A: [1..Nn] real;
seeqd(0);
for i in 1..Nn do A[i] = ranqd();
on here.gpus[0] {
   const ng = Nn;
   var ah = A;
   const npu = 3200;         // the number of GPU cores needs to be supplied
   assert( ng % npu == 0);   // make sure we can distribute the array 
   var m = ng/npu;           // # of elements to be summed by each core
   var psum: [0..npu-1] real = 0.0;
   var rt: stopwatch;
   rt.start();
   @assertOnGpu
   forall pu in 0..npu-1 do {
      var beg = pu*m + 1;
      var end = (pu+1)*m;
      for i in beg..end do {
         psum[pu] += ah[i];
      }
   }
   var sum = 0.0;
   for pu in 0..npu-1 do {
      sum += psum[pu];
   }
   rt.stop();
   writef("sum = %12.4dr  took %12.8dr s\n", sum,rt.elapsed());
}

Hi Nelson —

I believe Engin's out for the evening, but have you tried using a + reduce within the on here.gpus[0] clause directly? I believe that's what he was suggesting above, and that should map down to highly-tuned GPU implementations from Chapel 2.1 onwards, if I've got my facts straight. Unless I've missed something, above, you're only comparing CPU + reduce with hand-coded GPU reductions?

-Brad

Hi Nelson,

Using just a reduction is probably not a great benchmark here. Reductions have two characteristics to me, where neither are great for GPUs:

  • They have very little computation for each memory read they do: You literally do memory read and a very simple arithmetic operation. Compared to CPUs, GPUs have much lower bandwidth per core. Therefore, they shine when you can bring in some data from the memory and do some number crunching using that as much as you can. Your attempt to create blocks is probably a way to give more work per core, but you are actually also causing more memory reads per core as well.
  • Reductions inherently need synchronization: Say you have produced a result for each thread, how are they going to come up with the final, scalar value? What happens in our reduction support (and also in CUB, which is used by thrust) is you do a tree-based reduction where you gradually "mask out" GPU threads where more and more of them lay idle, while only a few of them first reduce results from their block (by default, 512 threads in Chapel). In the end of this step, you are still not done -- all you have is per-block result of reductions, which has to also be reduced across blocks. Unfortunately, inter-block data exchange on GPUs is not directly possible. In our support (and also in CUB), this result in subsequent kernel launches with smaller and smaller data, where in the end, I wouldn't be surprised if the final couple of hundreds of numbers are reduced by the host, even.

That makes sense. Although 16M reals is very very small amount of data. There's also the question of the order of memory reads, where your current implementation makes each thread read the global memory sequentially. Counter-intuitively for someone who is used to optimize CPU performance, that's not the ideal way of reading GPU's global memory by all threads. You want threads next to each other read data that's next to each other in memory as well, resulting in strided access per GPU thread. Below is an experiment of my own:

use Time;
use GPU;
use ChplConfig;

enum reduceMode { expr, blocked, strided };

config type dataType = real;
config const useGpu = CHPL_LOCALE_MODEL=="gpu";
config const nElems = 100;
config const mode = reduceMode.expr;
config const elemsPerThread = 5000;

if mode!=reduceMode.expr then assert(nElems > elemsPerThread &&
                                     nElems % elemsPerThread == 0);

var t: stopwatch;

on if useGpu then here.gpus[0] else here {
  var Arr: [0..#nElems] dataType = 17;

  var sum: real;

  select mode {
    when reduceMode.expr {
      t.start();
      sum = + reduce Arr;
      t.stop();
    }
    when reduceMode.blocked {
      const numThreads = nElems/elemsPerThread;

      t.start();
      forall t in 0..#numThreads with (+ reduce sum) {
        for i in t*elemsPerThread..#elemsPerThread {
          sum += Arr[i];
        }
      }
      t.stop();
    }
    when reduceMode.strided {
      const numThreads = nElems/elemsPerThread;

      t.start();
      forall t in 0..#numThreads with (+ reduce sum) {
        var cur = t;
        // wanted to use a strided range, but for some reason it performed badly
        for 0..#elemsPerThread {
          sum += Arr[cur];
          cur += numThreads;
        }
      }
      t.stop();
    }
  }

  writeln(sum);
  assert(sum == nElems*17);
}


writeln("Time (s): ", t.elapsed());
writeln("Throughput (GiOP/s): ", nElems/t.elapsed()/(1<<30));

This compares + reduce Arr to blocked and strided approaches.

  • CPU: i5-11400 @2.6GHz, 12 cores
  • GPU: RTX A2000

On CPU, I get 2 GiOP/s, that is 2 billion summations per second.

On GPU:

  • reduce expression: .61 GiOP/s
  • blocked loop (5000 elems per thread): .63 GiOP/s
  • strided loop (5000 elems per thread): .68 GiOP/s

Where it looks like, the CPU is hard to beat, for reasons above. And also there could be something you can squeeze more out of your GPU beyond + reduce Arr, but that would probably be marginal.

You'd typically want to reduce on GPU if your data is already on the GPU and is needed for much more computationally intensive operation (or generated as a result of). Or, similarly, you have a single big kernel, which performs a reduction aside some other computationally intensive operation, e.g.

forall elem in Arr with (+ reduce sum) {
  sum += doALotOfScience(elem);
} 

I hope this helps,

Engin

1 Like

Hi Brad and Engin: Regarding running + reduce in the gpu and timing it, the result (with Time.stopwatch) is in the last row below (I am repeating the former results to ease comparison)

redsum: sum = 8001547.9314 took 0.00205700 s
mansum-g sum = 8001547.9314 took 1.10298875 s
mansum-gsw sum = 8001547.9314 took 0.02228300 s
redsum-g: sum = 8001547.9314 took 0.00678400 s

This agrees in general with Engin's remarks: + reduce in the GPU is more than 3 times slower than + reduce in the CPU. I learned a lot in these exchanges: including (a) that I can run a stopwatch in the GPU, and (b) that GPUs are good at massively parallel operations such as A = B + gamma*C (A,B,C arrays, gamma a scalar), but not for other things. I will keep those lessons in mind as I write my applications, hopefully for the GPU as well :slight_smile: Many thanks for the enlightening comments.

2 Likes

As much as I like our design, I understand that what runs where is a bit confusing, and choosing the right words to describe things can be difficult. Where the overarching rule is, if you are executing on a GPU sublocale, e.g. with something like on here.gpus[0], only GPU-eligible things will execute on the GPU.

Here are some examples to help:

// you can't tell whether a function will execute on the GPU or CPU
// by just looking at its definition
proc foo() {
  var x: int;
  forall ...
}

// we just started executing on the CPU, nothing will magically end up on the GPU.

// foo will be invoked by the CPU
// the forall inside foo will also execute on the CPU even if
// it was GPU eligible.
foo(); 

on here.gpus[0] {
  // now we are executing on the GPU  _sublocale_
  // that doesn't necessarily mean everything inside this block will 
  // be executed by the GPU

  var x: int; // this is a scalar, it will _not_ be on the GPU
  var r: myRecord; // this is a record value, it will _not_ be on the GPU
  var t: 3*int; // ditto, records and tuples are similar. _not_ on the GPU
  var c: MyClass; // classes are different -- this will be allocated on the GPU
                  // but still will be accessible by the CPU with some magic
  var Arr: [1..n] int; // arrays will be allocated on the GPU
                       // the CPU can access it through communication under the hood
                       // it will not be great for performance if you use this on CPU

  for... // for loop is sequential -- executed on the CPU
  
  forall ... // if GPU-eligible, will turn into a kernel

  Arr = 3;  // this is a whole-array operation. Under the hood, this is actually a forall
            // this will execute on the GPU

  writeln(Arr);  // you can do this, it will execute on the CPU because it is IO.
                 // elements will be read one-by-one with communication (i.e. cudaMemcpy)

  foo();  // the actual function call will be executed by the CPU.
          // all the rules I outlined above now applies to the body of foo
          // e.g. "var x" within the body will be on CPU memory
          // e.g. the forall within the body will execute as a kernel
          // (the same loop can execute both on the CPU and GPU depending on the context)
}
1 Like

Got it. Thanks.