Performance Considerations

I am just looking at the inner workings of a Merge Sort algorithm which
tries to merge

 t[lo .. mid by s] which is sorted

and
t[mid + 1 … hi by s] which is also sorted

back into itself, i.e. t[lo…hi by s] where s is the stride of ‘t’.

Could I get comments on my any comments below beginning with IS please?

 proc mergeTwoBits
 (
     t: [?tDom] ?tType,
     lo : ?I,
     mid : I,
     hi : I
 )
 {
     // duplicate 't' but with a stride guaranteed to be one making
     // its access during comparisons as cache friendly as possible

     // IS this declaration and implicit copy the absolute optimal
     // or should it be done with a 'forall' or done different ways
     // depending on whether the stride is one (1) or not one (!= 1)?

     const b : [lo..hi] tType = t;

// some integers to drive things

     const s = tDom.stride;

     var i = lo, j = mid + 1, k = lo;

     // effectively split the copy as lower/upper chunks,
     // i.e. b[lo .. mid] and b[mid+1 .. hi] respectively
     //
     // merge this split copy back into the original until
     // one of those chunks is exhausted of elements

     while i <= mid && j <= hi do
     {
         var _k : k.type;

         if b[i] <= b[j] then
         {
             _k = i; i += 1;
         }
         else
         {
             _k = j; j += 1;
         }

         t[k] = b[_k];
         k += s;
     }

     // copy any of the lower piece that is left

     // IS if i <= mid then t[k .. hi by s] = b[i .. mid] any better/faster

     // NOTE THAT if i <= mid then t[k..hi::s] = b[i..mid] looks cleaner

     while i <= mid do
     {
         t[k] = b[i];
         i += 1;
         k += s;
     }

     // copy any of the upper piece that is left

     // IS if j <= hi then t[k .. hi by s] = b[j .. hi] any better/faster

     // NOTE THAT if i <= mid then t[k..hi::s] = b[j..hi] looks cleaner

     while j <= hi do
     {
         t[k] = b[j];
         j += 1;
         k += s;
     }
 }

 proc main
 {
     var x : [1..100] int;

     // a whole lot of missing stuff goes in here

     mergeTwoBits(x, 1, 50, 100);
 }

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

There are bugs in what I posted, but the comments/questios still hold.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Hi Damian —

For a dense array t, this should be about as optimal as it can be written in Chapel. Or at least, I’m not aware of a better way to write it, and if this didn’t give us our top performance, we’d want to know about that. Maybe put another way, I’d never suggest someone use a forall here instead.

The thing that gives me pause about this is that you mention a potential non-unit stride, but if t is strided, b seems like it is going to have more elements than t. E.g., if lo is 1, hi is 10, and t is over {1..10 by 2}, t will have 5 elements to b's 10. For that reason, I’d just write it as const b = t; Alternatively, you could write it as const b: [t.domain] tType = t; Or, if the goal is to make b into a non-strided array, I think you’d want to do something more like const b: [1..t.size] tType = t; (but then you’d also need to change your low, mid, high values into 1-based values?)

Like a lot of performance questions, I think the answer here, unfortunately, is “it depends.” Beyond the elegance benefits of your slice-based version, it also has the potential to use parallelism and/or bulk copies (e.g., memcpy()s) to do the copies. But it has the downside of creating array slices, which have some overhead in our current implementation.

So if the number of elements is small, your while loop is probably best; whereas if the number was large, the slices could have benefits if we’re doing our job right.

This “parallel copying of arrays” is an area that @e-kayrakli has been working in recently, so he may have additional thoughts. Related, we’ve discussed what it would take to reduce the overhead of slicing for cases like this that are used once for an assignment and then dropped on the floor—in contrast to cases where the slice is passed to a routine expecting an array and has to act like an array for an extended period of time as a result.

-Brad

This “parallel copying of arrays” is an area that @e-kayrakli has been working
in recently, so he may have additional thoughts…

If you are copying from a non-strided slice to a strided one, then that’s a
complex transfer which we don’t parallelize (I am sure we don’t at module code,
my guess is that we don’t do that in runtime code either). If the LHS in
your assignment wasn’t strided, than that would be parallelized if the
destination and the source data are on the same locale and the amount that you
transfer is more than 2MB. In that scenario, you can play with this threshold
by setting -sparallelAssignThreshold or enable parallel transfer where the
source and the destination are on different locales by using
-senableParallelPutsInAssignment and/or -senableParallelGetsInAssignment.

We have seen that parallel comm in array assignment helps in many cases but it
hurts in some, and that seemed to be system-dependent, which is not too
surprising. So, your mileage may vary.

Brad,

I will merge my reply to yourself with that from Engin. But not today.
Over the weekend.

I also notice that I need to put this email in a bit more context.

Catch you - Damian

P.S. I use my tweaks to the compiler so I can use a slice

21..41::2

It looks much cleaner than using ‘by’ with spaces either side.

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Catch you - Damian

P.S. I use my tweaks to the compiler so I can use a slice

21…41::2

Why did discourse strip the tab from my email here.

It looks much cleaner than using ?by?with spaces either side.

Why did discourse change a single quote to a question mark and strip the
space before the word with?

Is it doing naughty things with character sets or Unicode?

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Damian —

Your first message looks good as rendered on the Discourse site:

Or in my (text-based) inbox:

Or in my Outlook inbox:

Discourse interprets text as markdown, which is why the tab’d code got rendered differently; and it seems that something put smart quotes around your ‘by’, which may be causing troubles for your email program due to being unicode?) though I couldn’t say what if it wasn’t you.

-Brad

I put ordinary single quotes around

by

Programs should not screw with input unless requested.

There needs to be a way to disable markdown. Actually, markdown should not
be the default. But if you want markdown, you should be able to easily ask
for it.

Forcing interpretation of text to be markdown really makes discourse less
than usable as a mailing list.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Some of my code was less than correct.

But your email and Engin’s answered my questions.

Consider for example that one wishes to sort every second element of an an
array of elements, say

t : [1..100] int(64);

You might call the routine using

sort(t[1..99 by 2])

where you then have

 sort(t : [?tD] ?T)

If using TimSort, then one of the operations to be done is a merger.
This requires the use of a temporary array which is used for all the
comparisons. This copy could be done as

const b = t; // ....(A)

However, if ‘t’ was strided, then ‘b’ will also be strided, making it
subject to cache misses. Also, that copy is slow but Engin’s email
explains why.

Performing the copy as done by (A) will result in an array with lots of
elements in ‘b’, most of which never enter into any operations. If say

tD.stride == 2

then ‘b’ will be twice as big as needed.

Hence ‘b’, should really be forced to have a unit stride to minimize these
cache misses. So, instead we can define an array

const b : [1..tD.size] = t; // ....(B)

Does that not make more sense thn (A)? Also, Engin’s email says that this
will be parallelized.

I noticed that changing that copy of (B) to a ‘for’ or ‘forall’ loop would
often result in faster performance. But I did not investigate this in any
detail. Engin’s email points me to ways I can look into that problem further.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

The replacement of single quote with "smart" quotes was an option
independent of the markdown choice which I hadn't noticed before today.
I've turned that feature off now that I've discovered it.

-Brad

Thanks heaps - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Hi Damian —

There's a misconception here that I want to address. A Chapel array declared like var A: [1..10 by 2] real; stores its five elements densely in memory, not with "blank" elements in between them. So the cache behavior should be no different / worse than using var B: [1..5] real;. The following program demonstrates this by printing out adjacent addresses for A[1] and A[3]:

use CPtr;

config var n = 10;
var A: [1..10] real;

proc foo(X) {
  extern proc printf(x...);
  var T = X;
  writeln(T.domain);
  printf("%p %p\n", c_ptrTo(T[1]), c_ptrTo(T[3]));
}

foo(A[1..10 by 2]);

For that reason, the behavior between your (A) and (B) expressions is not so different, in that either is gathering from strided memory to dense memory. All else being equal, I'd probably focus on (B) because there tends to be overheads associated with operating on strided arrays that isn't present for arrays with unit stride. But I'm less familiar with when we do / don't parallelize array assignments, so can't speak to that without experimenting.

-Brad

I am lost. I will get back to you. It looks like slicing is making a copy rather than creating a ref.

Damian -

It might or might not help, but I'd like to point out a common stumbling block with slices and copies. When you slice, if you store it into a var or const variable it will copy the elements. If you pass it to a function with default intent or store it into a ref, it won't. The below shows my understanding of the situation:

var A:[1..10] int;

ref Slice = A[1..3]; // no copy
const ref ConstSlice = A[1..3]; // no copy
proc foo(X) { }
foo(A[1..3]); // no copy
 
var CopyOfSlice = A[1..3]; // copy when storing into the `var`
const ConstCopyOfSlice = A[1..3]; // copy when storing into the `const`

In the sort code in the package module Sort, we sometimes avoid slicing and instead pass start/end indices to the inner routines.

Sorry, I was unclear.

When my routine accepts a slice some of original data, the argument is given a default intent so I still keep the original layout.

My main worry was that in doing something like

var T = A;

it might try and do the copies with vector operations.

That it does this in parallel is good and allays my fears.

I did not realize was that the domani of the copy T (in the @bradc example) was not only the same as that of A but that it compacted the way it stored the data so that there were not the empty slots in memory (and hence everything was cache friendly).

All clear now. I had never gleaned that from the Chapel documentation but maybe I mussed it.

In my own work, I will need to make the copy like

var T : [1..A.domain.size] A.eltype = A;

so my existing algorithm still works as it assumes that the copy has a stride and first subscript that are both 1.

P.S. I hate the concept of extracting the base type of A as A.eltype. While the individual components of an array are called the elements of an array, the type of those elements has been called the base type of an array for over 50 years (which is longer than I have used a computer). Is there any chance of introducing a duplicate method on an array called baseType that matches eltType at the same time as we consider a new name for the stridable method or attribute of an array? Too trivial to raise an issue at this time.

In var A: [1..n] [1..3] real; would you imagine A.baseType to give real or [1..3] real;?

 	real

My understanding of the words

 	base type of an array A

is that it is the type of the elements of A.

Thanks - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

[edit: Striking this response because I read past the real too quickly and interpreted "the type of the elements of A" as meaning what eltType is today, which is to say [1..3] real]

That's too bad. Had it been the other definition, we could've supported both without conflict. I'm personally disinclined to do this name change, both due to it being a breaking change late in the day, and because to me "base" suggests something more primitive/basic (like the ultimate element type) than "element type" does. I.e., sometimes names that persist for 50 years aren't necessarily the best names. We've intentionally avoided some other longstanding names on similar grounds. I think the thing that would convince me otherwise would be if we found that every language we care about appealing to used the term baseType (I haven't done that survey).

[edit: I'm not striking out the following because even though it doesn't implement what you're wanting, it's still pretty reasonable information]

Currently, there's no official user-facing way to create a method on an array, but this is something we should work on supporting officially, in which case a user group could create their own names for whatever methods they wanted. In the meantime, I believe the following (subject to change in the future) code should work:

proc _array.baseType type return eltType;

var A: [1..10] real;
var B: [1..10] [1..3] real;

writeln(A.baseType:string);
writeln(B.baseType:string);

TIO

and of course if it were written as a standalone function, like:

proc baseType(X: []) type return eltType;

then it would be future-proof.

-Brad

Oops, @vass pointed out that I read past the real part of your answer too quickly and fixated on the "the type of the elements of A" part, which to me said "so it's the same as eltType." But A.eltType returns [1..3] real for the example above, hence my previous response. I've edited it on Discourse to strikethrough the parts that are outdated now.

So now that Vass has pointed out my error, that suggests that we could support both queries without duplicating a feature. For example, I'm imagining something like:

proc _array.baseType type {
  if isArrayType(eltType) then
    return baseTypeHelper(eltType);
 else
   return eltType;
}

proc baseTypeHelper(type t) type {
  if isArrayType(t) then
    return baseTypeHelper(t.eltType);
  else
    return t;
}

I don't believe that this will work today, but I don't think it would take much effort to get it working. I think this would be worth filing a feature request for, avoiding phrases like "the element type of the array" or "the base type of the array" to avoid confusion and demonstrating what it would mean for some examples like the ones here.

-Brad

Brad,

proc _array.baseType type {
if isArrayType(eltType) then
return baseTypeHelper(eltType);
else
return eltType;
}

proc baseTypeHelper(type t) type {
if isArrayType(t) then
return baseTypeHelper(t.eltType);
else
return t;
}

I don't believe that this will work today,

You are correct.

but I don't think it would take much effort to get it working. I think
this would be worth filing a feature request for, avoiding phrases like
"the element type of the array" or "the base type of the array" to avoid
confusion and demonstrating what it would mean for some examples like
the ones here.

I will submit the request on the weekend, complete with examples. Getting
the English semantically correct will be a challenge. It is not urgent.

And, you have confirmed a solution for both a 1D and a 2D conventional
array so I am very happy for now.

Thanks heaps - Damian