Performance of assignments involving slices

Consider the following assignment:

localBuffer[hash, localOffsets[hash], ..] = arrayChunk[.., i];

localBuffer is an array of shape (numLocales, N, M) which lives on current locale. arrayChunk lives on current locale as well and has shape (M, N) which lives on current locale. And localOffsets is local as well and has numLocales elements.
Here, N is rather big (think, 10^4-10^6) and M is rather small (probably smaller than 10).

This assignment was being called quite often and ended up eating a big chunk of my execution time. Rewriting it into a manual loop like this one:

for k in 0 .. numberVectors - 1 {
  localBuffer[hash, localOffsets[hash], k] = arrayChunk[k, i];
}

reduced the execution time by a factor of ~300!

I'm not posting this as a rant, but rather hoping that someone could explain which additional work the assignment of slices does that it ends up being so expensive. I imagined that since localBuffer[hash, localOffsets[hash], ..] was a contiguous chunk while arrayChunk[.., i] was not, Chapel would implement the assignment as a loop under the hood. Is this not the case? and if so, why?

Thanks!
Tom

Hi Tom —

That is interesting. I can't explain the difference without investigating more, which will require a reproducer, or more information. I think you are saying that both arrays are completely local / non-distributed arrays (as opposed to arrayChunk being distributed and the slice of it you're referring to being local), is that right? Are you running on one locale or multiple? And if multiple, oversubscribed, or on a cluster?

One obvious difference between the two expressions is that even if the compiler were to use a loop for the original code, I would expect it to use a forall loop. If you change your for loop into a forall, do you see additional benefit or slowdown?

I know we've talked about it, so assume that both the before and after timings were gathered with --fast?

Thanks,
-Brad

Is it because, today, slice expressions run the allocator to create a slice object? (And, I think for distributed arrays, they might do work on every locale in the process, but I am less sure about this).

I suppose it could be, but I'd been imagining that the large problem size would amortize away the O(1) memory allocator costs. That said, I jumped to "large problem size" from the English text and didn't actually think at the values until now, which I'm seeing aren't terribly large. So that's definitely a possibility.

Tom, one way to test this hypothesis would be to separate the creation of the slices from the assignment and time the respective parts. E.g.,:

ref locBuff = localBuffer[hash, localOffsets[hash], ..];
ref arrChunk = arrayChunk[.., i];
locBuff = arrChunk;

where timing the first two statements would give the cost for creating the slices (including the allocator calls Michael mentions) while timing the second would be the time for the assignment itself.

-Brad

So I've done a few more timings.

  • Manual loop:
    for k in 0 .. numVectors - 1 {
      buffer[hash, offsets[hash], k] = array[k, i];
    }
    
    => 0.876606 0.888626 0.850986 0.829717 (time per locale)
  • forall loop:
    forall k in 0 .. numVectors - 1 {
      buffer[hash, offsets[hash], k] = array[k, i];
    }
    
    => 36.7382 29.0505 25.9959 26.0736
  • copying slices:
    buffer[hash, offsets[hash], ..] = array[.., i];   
    
    => too long :frowning:
    On an input which is 64 times smaller I get: 18.815 0.0 0.0 0.0. There's no work left for other locales because the input is so small, hence the zeros in timings.
  • copying references:
    ref dest = buffer[hash, offsets[hash], ..];
    const ref src = array[.., i];
    timer.start();
    dest = src;
    timer.stop();
    
    => too long :frowning:
    Again, on a 64-times smaller input: 8.57173 0.0 0.0 0.0

To answer your other questions, yes, I'm running on my laptop with CHPL_RT_OVERSUBSCRIBED=yes, CHPL_RT_NUM_THREADS_PER_LOCALE=1, and 4 locales (I have 4 physical cores). And yes, I'm compiling with --fast and CHPL_TARGET_CPU=native.

Also, in all the cases above M was actually 1 so the loop is only executed once.

And a quick test with a N=327680 and M=8 shows that manual loop is still about a 100 times faster than copying references.

Hi Tom —

Here's a question I'm still curious about:

The fact that you didn't answer might suggest I was correct? But then I'm not clear on what would cause other locales to ever be involved either.

The fact that switching from a for to a forall hurts your timings so much seems a little suspicious to me, where my head goes to it being related to the oversubscription. When time permits, I'm going to see if I can create a reproducer (unless you have one handy that I can use) and see what happens in a non-oversubscribed setting.

-Brad

Not a direct answer to the performance difference here, but I'll just note that these are rank-change views, not slice views. It wouldn't surprise me if rank-changes are less optimized compared to slices both in terms of creation and bulk communication.

@bradcray sorry about that, I totally forgot to answer it! All arrays are local, i.e. in the code snippets which I'm showing there is no communication involved. However, I have multiple locales operating on different buffers and that's where time per locale comes from.

I'm afraid I don't have a small piece of code to share which is readily available. I did just try to run with one locale and CHPL_RT_NUM_THREADS_PER_LOCALE=4 such that there is one qthread running per physical core, and forall loop still has huge effect on performance.

@bradcray I have an example for you to play with:

use Time;
use Random;

config const N = 100000;
config const M = 8;
config const iterations = 100;
config param algorithm = 0;

proc distributeBasedOnHash(const ref array : [?D] ?eltType,
                            const ref hashes : [] int,
                            ref timer : Timer) where (D.rank == 2) {
  const numVectors = D.dim(0).size;
  const size = D.dim(1).size;
  var buffer : [0 .. numLocales - 1, 0 .. size - 1, 0 .. numVectors - 1] eltType = noinit;
  var offsets : [LocaleSpace] int;
  for i in 0 .. size - 1 {
    const hash = hashes[i];
    timer.start();
    if (algorithm == 0) {
      for k in 0 .. numVectors - 1 {
        buffer[hash, offsets[hash], k] = array[k, i];
      }
    }
    else if (algorithm == 1) {
      forall k in 0 .. numVectors - 1 {
        buffer[hash, offsets[hash], k] = array[k, i];
      }
    }
    else {
      assert(false);
    }
    timer.stop();
    offsets[hash] += 1;
  }
  return (buffer, offsets);
}

proc main() {
  var array : [0 ..# M, 0 ..# N] real;
  var hashes : [0 ..# N] int;
  fillRandom(hashes);
  hashes %= numLocales;

  var timer = new Timer();
  var (buffer, offsets) = distributeBasedOnHash(array, hashes, timer);
  writeln("Took ", timer.elapsed());
}

then on my laptop I get:

CHPL_TARGET_CPU=native chpl --fast -s algorithm=0 -o copying copying.chpl
CHPL_RT_NUM_THREADS_PER_LOCALE=4 ./copying -nl 1
# ==> 0.014
CHPL_TARGET_CPU=native chpl --fast -s algorithm=1 -o copying copying.chpl
CHPL_RT_NUM_THREADS_PER_LOCALE=4 ./copying -nl 1
# ==> 0.16

Oh, that's a really good point Engin, I'd completely overlooked that. Indexing into rank-change arrays currently is definitely more expensive, I believe. So a thing to try then as another point of comparison might be:

const off = localOffsets[hash];
localBuffer[hash..hash, off..off, ..] = arrayChunk[.., i..i];

-Brad

Hm... it does not compile: rank mismatch in array assignment

Oh shoot, you're right, I'm multitasking too much.