Reindexing to same array

Hello! I tried to “cheat” the compiler into redefining an array’s domain with the code

var x : [0…4] real = {1…5};
writeln(x.domain); // prints {0…4}
x = x.reindex(1…5); // executes silently…does nothing?
writeln(x.domain); // prints {0…4}
for i in 1…5 do writeln(x[i]); // runtime error on x[5]

The compiler did not complain. Apparently the x = x.reindex(1…5) did nothing, and then
there was a runtime error when x[5] was reached which is logical since the domain had
not changed. Is the “reindex doing nothing” the expected compiler behavior?

Thanks in advance, Nelson

Hi Nelson —

Good question. x = x.reindex(1..5); is doing something, just nothing interesting. Specifically, x.reindex(1..5) creates a new array that’s a view of x's elements that’s indexed using 1..5 rather than 0..4. E.g., if you wrote (x.reindex(1..5))[0] (i.e., you tried to access element 0 of this array), you’d get an out-of-bounds error. Similarly, (x.reindex(1..5))[5] would work.

Then, that view you’ve created is assigned to x which still has the original indices 0..4. This is legal because Chapel permits assignment between rectangular arrays with different indices if they have the same size and shape. As a clearer example, the following code is legal Chapel:

var A: [0..4] real;
var B: [1..5] real;
var C: [1..10 by 2] real;

A = B;  // copies B's five elements to A; doesn't change A's indices
A = C;  // copies C's five elements to A; doesn't change A's indices

So your assignment is effectively like the A=B case above, copying the elements (which were already x’s elements, so that’s not very interesting), but not copying/changing/modifying the indices.

A more typical way to use a reindexed expression is using a ref declaration, like:

ref x1 = x.reindex(1..5);

at which point your loop would work as expected (replacing x with x1):

for i in 1…5 do writeln(x1[i]); // runtime error on x[5]

Another common pattern is to pass the reindex expression into a function that takes an array argument, where the function will treat it like any other array.

One final note: reindex views of arrays result in overhead, both to set up the view, and on every access (because the logical reindexing is computed on every access). So think of this feature as a convenience feature rather than one you’d want to lean on heavily in performance-sensitive code.

The “real” way to redefine an array’s domain is to declare the domain as a variable:

var D = {1..5};
var A: [D] real = [i in 1..5] i;

and then to re-assign the domain:

D = {0..4};

however, it’s important to note that for a given index i in both the old and new domain value, A[i] is logically preserved across this assignment. I.e., this isn’t a memory-focused re-alloc command, but a logical change of the array’s indices, preserving any elements that are in both the old and new index set. Thus, in the code above, the resulting values of A are:

0 1 2 3 4

where A[1..4] preserved their original values, A[5] was dropped on the floor, and A[0] just happens to be 0 since that’s the default value for integers (i.e., changing the domain to -1..6 would similarly set elements A[-1] and A[6] to 0).

Hope this is helpful,
-Brad

Hi Brad,

Many thanks for the answer. Your answer is crystal clear, and I should have guessed it from the rule that assignment between arrays with the same shape is permitted! The question arose when I was testing a procedure for matrix-vector multiplication (serial, very old-fashioned fortran-77 like) that would work for any square matrix A shape (n,n) and any vector x shape (n,) regardless of their domain indices. After your reply, I tested 3 versions: v1 only allows domains {1…n,1…n} and {1…n}; v2 uses ref rx = x.reindex{1…n}, etc. locally, and v3 finds the first indices and incoporates them into the code, ie a[ia+i,ja+j]*x[jx+j], etc., where ia,ja,jx are obtained from calls to the a.first, x.first (and y.first). I timed them: n == 1000, and 100 or 1000 repetitions. With all default switches (meaning bounds checking turned on), v1 ~ v3 in speed, while v2 (which uses reindex) is 6.8% slower. Interestingly, when all 3 versions are compiled with --fast, then v2 is slightly faster, so in this setting reindex does not seem to hurt performance… Note that this is overall runtime, and includes calls to fillRandom. Thanks again,

Nelson

Huh, that is interesting, and surprising to me. But I won't complain.

Assuming this is a classic "triply nested loop" style of mat-mult algorithm, my head goes to doing the index-neutral approach by querying the domains' dimensions using something like:

const (rows, cols) = Mat.domain.dims();
var Result: [rows] real;
forall i in rows do
  for (jm, jv) in zip(cols, Vect.domain.dim(0)) do
    Result[i] += Mat[i, jm] * Vect[jv];  // or we could play the standard trick of accumulating into a temp

Or, if it were reasonable to assume that the matrix and vector shared the same indices in the second dimension, you could simply assert that to avoid the use of the zippered inner loop and just re-use a single j index for both Mat and Vect.

Hi again Brad. Thanks for showing the totally index-agnostic version
of matrix multiplication :slight_smile: . I included your code as version 4 (but
using for, instead of forall), and timed the 4 versions with
A(1000x1000) x,y(1000), and 1000 repetitions, with --fast. Only the
matmul operations are being timed. The sources are in a Dropbox folder,
that can be downloaded here:

Here is the result, in seconds, in a Dell Vostro ~4 years old, with
8 i7 cores and 8 Gb of RAM:

v1 = all ranges are 1…n
v2 = reindex
v3 = sum index offsets
v4 = zip using dims(), dim(0)

 v1     &     v2      &       v3    &         v4 \\

11.6832 & 11.6516 & 11.6582 & 11.6691

The optimization seems to be producing code that runs equally fast
regardless of the version.

Cheers

Nelson

Cool, thanks for sharing your code and results! Are all four versions amenable to having the outer loop made into a forall? (I haven’t had the chance to open them up to look myself). If so, you should try it to make better use of the other 7 of your 8 cores!

Yes, it is trivial to change the outer loop into a forall. Leads to a ~4x speed gain,

 v1 &     v2 &       v3 &      v4

for --> 11.6832 & 11.6516 & 11.6582 & 11.6691
forall --> 0.367176 & 0.372347 & 0.362295 & 0.376119

because for some reason only 4 cores (0–3) are active. I can see the cores’ loads by looking at a Linux Mint Desklet called CPULoad. But really easy and simple to do :slight_smile:

… and putting export CHPL_RT_NUM_THREADS_PER_LOCALE=8
in .bashrc activates all 8 cores :slight_smile:

I do not know for certain this is the case but I suspect that the reason it was using 4 cores is that Chapel defaults to not using hyperthreads. I suspect your system has 4 cores with 2 hyperthreads each. In fact you can set CHPL_RT_NUM_THREADS_PER_LOCALE to MAX_LOGICAL or MAX_PHYSICAL to toggle between considering the hyperthreads or not without having to specify the number itself (although using the number is fine if it works for you!). See also https://chapel-lang.org/docs/master/usingchapel/tasks.html#controlling-the-number-of-threads .

1 Like

yep, MAX_LOGICAL also activates the 8 “cores” :slight_smile: thanks.

1 Like

What performance improvement did you see when you went from 4 logical cores to 8 logical cores?

Our experience with floating point intensive tasks says that it is best to disable hyperthreading. An FPU does not like being shared between threads. Well, certainly when the underlying environment ts that which is created with a big Fortran job running under MPI.

Certainly this is the case for our large finite element jobs where one has (say) 24 physical cores and 48 logical cores with hyperthreading.Telling a program that it has 48 cores when it only has 24 actual floating point units to share between them all, confuses the hell out of any scheduling because 50% of the time the FPU is busy handling somebody else.

Mind you, if Chapel (and its PGAS underlying architecture ) handles this scenario better and knows how to share FPUs optimally, then one more good reason to use Chapel.

My 2c.

I have very little first-hand experience with that: I only made a few experiments with matrix multiplication, etc., on a desktop. But I have heard from people who give support to numerical weather prediction that they do turn off hyperthreading in their cluster as well, so my only contribution is ... hearsay. If I get more serious results, I will share.

Cheers

Nelson

I have very little first-hand experience with that: I only made a few
experiments with matrix multiplication, etc., on a desktop. But I have heard
from people who give support to numerical weather prediction that they do
turn off hyperthreading in their cluster as well,

Everybody I know (who using Fortran programs with C supporting routines
run under MPI) for finite elements does this.

so my only contribution is ... hearsay. If I get more serious results, I
will share.

Thanks - I would be curious how Chapel goes.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer