Performance Considerations

I am just looking at the inner workings of a Merge Sort algorithm which
tries to merge

 t[lo .. mid by s] which is sorted

and
t[mid + 1 … hi by s] which is also sorted

back into itself, i.e. t[lo…hi by s] where s is the stride of ‘t’.

Could I get comments on my any comments below beginning with IS please?

 proc mergeTwoBits
 (
     t: [?tDom] ?tType,
     lo : ?I,
     mid : I,
     hi : I
 )
 {
     // duplicate 't' but with a stride guaranteed to be one making
     // its access during comparisons as cache friendly as possible

     // IS this declaration and implicit copy the absolute optimal
     // or should it be done with a 'forall' or done different ways
     // depending on whether the stride is one (1) or not one (!= 1)?

     const b : [lo..hi] tType = t;

// some integers to drive things

     const s = tDom.stride;

     var i = lo, j = mid + 1, k = lo;

     // effectively split the copy as lower/upper chunks,
     // i.e. b[lo .. mid] and b[mid+1 .. hi] respectively
     //
     // merge this split copy back into the original until
     // one of those chunks is exhausted of elements

     while i <= mid && j <= hi do
     {
         var _k : k.type;

         if b[i] <= b[j] then
         {
             _k = i; i += 1;
         }
         else
         {
             _k = j; j += 1;
         }

         t[k] = b[_k];
         k += s;
     }

     // copy any of the lower piece that is left

     // IS if i <= mid then t[k .. hi by s] = b[i .. mid] any better/faster

     // NOTE THAT if i <= mid then t[k..hi::s] = b[i..mid] looks cleaner

     while i <= mid do
     {
         t[k] = b[i];
         i += 1;
         k += s;
     }

     // copy any of the upper piece that is left

     // IS if j <= hi then t[k .. hi by s] = b[j .. hi] any better/faster

     // NOTE THAT if i <= mid then t[k..hi::s] = b[j..hi] looks cleaner

     while j <= hi do
     {
         t[k] = b[j];
         j += 1;
         k += s;
     }
 }

 proc main
 {
     var x : [1..100] int;

     // a whole lot of missing stuff goes in here

     mergeTwoBits(x, 1, 50, 100);
 }

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

There are bugs in what I posted, but the comments/questios still hold.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Hi Damian —

For a dense array t, this should be about as optimal as it can be written in Chapel. Or at least, I’m not aware of a better way to write it, and if this didn’t give us our top performance, we’d want to know about that. Maybe put another way, I’d never suggest someone use a forall here instead.

The thing that gives me pause about this is that you mention a potential non-unit stride, but if t is strided, b seems like it is going to have more elements than t. E.g., if lo is 1, hi is 10, and t is over {1..10 by 2}, t will have 5 elements to b's 10. For that reason, I’d just write it as const b = t; Alternatively, you could write it as const b: [t.domain] tType = t; Or, if the goal is to make b into a non-strided array, I think you’d want to do something more like const b: [1..t.size] tType = t; (but then you’d also need to change your low, mid, high values into 1-based values?)

Like a lot of performance questions, I think the answer here, unfortunately, is “it depends.” Beyond the elegance benefits of your slice-based version, it also has the potential to use parallelism and/or bulk copies (e.g., memcpy()s) to do the copies. But it has the downside of creating array slices, which have some overhead in our current implementation.

So if the number of elements is small, your while loop is probably best; whereas if the number was large, the slices could have benefits if we’re doing our job right.

This “parallel copying of arrays” is an area that @e-kayrakli has been working in recently, so he may have additional thoughts. Related, we’ve discussed what it would take to reduce the overhead of slicing for cases like this that are used once for an assignment and then dropped on the floor—in contrast to cases where the slice is passed to a routine expecting an array and has to act like an array for an extended period of time as a result.

-Brad

This “parallel copying of arrays” is an area that @e-kayrakli has been working
in recently, so he may have additional thoughts…

If you are copying from a non-strided slice to a strided one, then that’s a
complex transfer which we don’t parallelize (I am sure we don’t at module code,
my guess is that we don’t do that in runtime code either). If the LHS in
your assignment wasn’t strided, than that would be parallelized if the
destination and the source data are on the same locale and the amount that you
transfer is more than 2MB. In that scenario, you can play with this threshold
by setting -sparallelAssignThreshold or enable parallel transfer where the
source and the destination are on different locales by using
-senableParallelPutsInAssignment and/or -senableParallelGetsInAssignment.

We have seen that parallel comm in array assignment helps in many cases but it
hurts in some, and that seemed to be system-dependent, which is not too
surprising. So, your mileage may vary.

Brad,

I will merge my reply to yourself with that from Engin. But not today.
Over the weekend.

I also notice that I need to put this email in a bit more context.

Catch you - Damian

P.S. I use my tweaks to the compiler so I can use a slice

21..41::2

It looks much cleaner than using ‘by’ with spaces either side.

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Catch you - Damian

P.S. I use my tweaks to the compiler so I can use a slice

21…41::2

Why did discourse strip the tab from my email here.

It looks much cleaner than using ?by?with spaces either side.

Why did discourse change a single quote to a question mark and strip the
space before the word with?

Is it doing naughty things with character sets or Unicode?

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Damian —

Your first message looks good as rendered on the Discourse site:

Or in my (text-based) inbox:

Or in my Outlook inbox:

Discourse interprets text as markdown, which is why the tab’d code got rendered differently; and it seems that something put smart quotes around your ‘by’, which may be causing troubles for your email program due to being unicode?) though I couldn’t say what if it wasn’t you.

-Brad

I put ordinary single quotes around

by

Programs should not screw with input unless requested.

There needs to be a way to disable markdown. Actually, markdown should not
be the default. But if you want markdown, you should be able to easily ask
for it.

Forcing interpretation of text to be markdown really makes discourse less
than usable as a mailing list.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Some of my code was less than correct.

But your email and Engin’s answered my questions.

Consider for example that one wishes to sort every second element of an an
array of elements, say

t : [1..100] int(64);

You might call the routine using

sort(t[1..99 by 2])

where you then have

 sort(t : [?tD] ?T)

If using TimSort, then one of the operations to be done is a merger.
This requires the use of a temporary array which is used for all the
comparisons. This copy could be done as

const b = t; // ....(A)

However, if ‘t’ was strided, then ‘b’ will also be strided, making it
subject to cache misses. Also, that copy is slow but Engin’s email
explains why.

Performing the copy as done by (A) will result in an array with lots of
elements in ‘b’, most of which never enter into any operations. If say

tD.stride == 2

then ‘b’ will be twice as big as needed.

Hence ‘b’, should really be forced to have a unit stride to minimize these
cache misses. So, instead we can define an array

const b : [1..tD.size] = t; // ....(B)

Does that not make more sense thn (A)? Also, Engin’s email says that this
will be parallelized.

I noticed that changing that copy of (B) to a ‘for’ or ‘forall’ loop would
often result in faster performance. But I did not investigate this in any
detail. Engin’s email points me to ways I can look into that problem further.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

The replacement of single quote with "smart" quotes was an option
independent of the markdown choice which I hadn't noticed before today.
I've turned that feature off now that I've discovered it.

-Brad

Thanks heaps - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Hi Damian —

There's a misconception here that I want to address. A Chapel array declared like var A: [1..10 by 2] real; stores its five elements densely in memory, not with "blank" elements in between them. So the cache behavior should be no different / worse than using var B: [1..5] real;. The following program demonstrates this by printing out adjacent addresses for A[1] and A[3]:

use CPtr;

config var n = 10;
var A: [1..10] real;

proc foo(X) {
  extern proc printf(x...);
  var T = X;
  writeln(T.domain);
  printf("%p %p\n", c_ptrTo(T[1]), c_ptrTo(T[3]));
}

foo(A[1..10 by 2]);

For that reason, the behavior between your (A) and (B) expressions is not so different, in that either is gathering from strided memory to dense memory. All else being equal, I'd probably focus on (B) because there tends to be overheads associated with operating on strided arrays that isn't present for arrays with unit stride. But I'm less familiar with when we do / don't parallelize array assignments, so can't speak to that without experimenting.

-Brad