Partial reductions (operate on 2D-array and produce 1-D array)

e.g.
return [j in columns] (+ reduce [i in rows] (x[i] * z[i, …])) / h;

I am curious if this will be working any time soon. I cannot see anything
like it mentioned in any recent changes in ‘master’

I am just trying to figure out my schedule for the next few months, and
(specifically) wrapping up my presentation at CHIUW 2021 (if it is indeed
accepted). It is not a biggie if it is delayed but it would be nice if it
was working by the end of this year.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 … Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Hi Damian —

[sorry for the delayed response, I never did get on top of the weekend’s backlog]

Partial reductions, as a first-class concept (which is what I thought we’d talked about most recently), haven’t made it to the top of the priorities list yet, unfortunately, with interfaces being the latest thing to get in their way.

The pattern you’ve written here computes something partial reduction-like in a more explicit way than we’d ultimately like, from both an elegance and performance perspective; but when I talk about partial reductions not receiving attention in the previous paragraph, I’m thinking of the language-level support, not this more explicit style. If this form is acceptable to you (for the time being at least), we should be able to wrangle a version of this that works (I definitely have patterns like this working, albeit slightly different from yours).

I say “partial reduction-like” because given your loops over both i and j, I would’ve expected j to show up in the z indexing expression as well. E.g., something like z[i,j] * x[j] (though I may be focused on a case that’s the transpose of yours).

Let me know if I’m right that something’s amiss here, and once we get that straightened out, I’ll take another stab at writing a version that works today (though again, performance may be less than 100% ideal). And I have a rough idea of a second version that, while a bit hinky, might get us performance closer to what we’d want as well.

-Brad

I am using the web interface so discourse does not do naughty things to what I write.

Here are two routines I have where the last line of each is a comment which contains how I think I need to do the partial reduction, which comes from a suggestion of yours anyway. So between us we could both be wrong.

// these will have to do until partial reductions work

proc fudge(h : ?R, const ref x : [] R, const ref z : [?zD] R)
{
	const (rows, columns) = zD.dims();
	param fakeit = false;

	if fakeit then
	{
		var y : [columns] R = 0:R;

		for (r, c) in zD do
		{
			y[c] += x[r] * z[r, c];
		}
		return y / h;
	}
	else
	{
		return [j in columns] (vmDot(rows, z[.., j], x) / h);
	}

	// return [j in columns] (+ reduce [i in rows] (x[i] * z[i, ..])) / h;
}

proc fudge(h : ?R, const ref x : [] R, const ref z : [?zD] R, g : R)
{
	const (rows, columns) = zD.dims();

	return [j in columns] ((vmDot(rows, z[.., j], x) / h) / g);

	// return (([j in columns] (+ reduce [i in rows] (x[i] * z[i, ..]))) / h) / g;
}

If you look at the line above, that commented partial reduction, you see a line which produces a 1D array

[j in columns] (expression with an inner product using my routine vmDot)

Unfortunately, that code (which works) accesses the 2D array by columns which is the wrong way round in terms of optimal cache handling. I need to access the 2D array by rows.

As long as I can do this, I will write the Chapel code any way which works as long as I can access the 2D array by rows. If my attempt using a partial-reduction-like is the wrong way, I can change it. I just need to compute the numbers to compute a new 1D array with a dimension with a range of columns.

In the first routine, you will see (inactive code) showing how one can compute the 1D array using access by rows which is cache-friendly.

Hi Damian —

Sorry for the delay getting back to this thread. Here's how I'd propose you write this routine in Chapel as it stands today, with a reasonable combination of safety, elegance, and performance:

proc fudge(h : ?R, const ref x : [] R, const ref z : [?zD] R, param approach) {
  const (rows, columns) = zD.dims();
  var y : [columns] R;

  forall (r,c) in zD with (+ reduce y) do
    y[c] += x[r] * z[r, c];

  y /= h;

  return y;
}

The approach here is roughly:

  • use a forall so that the 2D array can be iterated over in parallel and in the way that the array author thinks is best (e.g., cache friendly)
  • use a with-clause on the forall to give each task its own copy of the result vector y so that (a) there are no race conditions due to trying to update a shared vector in parallel, and (b) there's no battling over the cache line that a single shared vector would live on

I think this solution is also attractive in that it's very similar to what I expect we will write ourselves for this case once we've actually finished adding our partial reduction support.

I arrived at this solution by writing a number of variations, some legal, some intentionally illegal, and timing them:

use Time;

// Different ways of writing this loop                                          
enum approaches {ser, naiveForall, naiveForallExpr, taskPrivPar, atomicPar};
use approaches;

// matrix size and what optional operations to do                               
config const m = 3, n = 5;
config const printArrays = false, verify=true;

// define our matrix and vector                                                 
var A: [1..m, 1..n] int;
var v: [1..m] int;

// fill them with silly, but unique values                                      
forall (i,j) in A.domain do
  A[i,j] = i*1000000 + j;

for i in 1..m do
  v[i] = i;

// create a serial reference result vector                                      
var r = fudge(1, v, A, ser);
if printArrays then
  writeln("serial result is:\n", r);

// test the other approaches                                                    
testApproach(naiveForall);
testApproach(naiveForallExpr);
testApproach(taskPrivPar);
testApproach(atomicPar);

// to test an approach, we'll time it, optionally print the array,              
// and then make sure it matches our reference result                           
proc testApproach(param approach) {
  var t: Timer;
  t.start();
  var s = fudge(1, v, A, approach);
  t.stop();
  writeln("approach ", approach, " took: ", t.elapsed());

  if printArrays then
    writeln("parallel result using ", approach, " is:\n", s);

  if verify {
    forall i in r.domain do
      if r[i] != s[i] then
        writeln("Mismatch at element ", i, " using ", approach, ": ",
                r[i], " != ", s[i]);
  } else {
    if approach == naiveForall || approach == naiveForallExpr then
      writeln("NOTE: ", approach, " isn't guaranteed to give correct results");
  }
}


// Here's the implementation of the five approaches                             
proc fudge(h : ?R, const ref x : [] R, const ref z : [?zD] R, param approach) {
  const (rows, columns) = zD.dims();
  var y : [columns] R;

  select approach {

    // serial: this is what you sent                                            
    when ser {
      for (r, c) in zD do
        y[c] += x[r] * z[r, c];
    }

    // naive forall loop: this isn't actually correct in general                
    // because if the 2D iteration spaces is chunked up by rows                 
    // (which is what Chapel typically prefers), multiple tasks                 
    // may try to update the same column location of 'y'                        
    // simultaneously.  So why did I write it?  To see how much                 
    // faster it would go...                                                    
    when naiveForall {
      forall (r,c) in zD do
        y[c] += x[r] * z[r, c];
    }

    // This is essentially the same as the previous, simply written             
    // shorthand.  Like the previous, it isn't actually correct in              
    // general.                                                                 
    when naiveForallExpr {
      [(r,c) in zD] y[c] += x[r] * z[r, c];
    }

    // This is the solution I picked                                            
    when taskPrivPar {
      forall (r,c) in zD with (+ reduce y) do
        y[c] += x[r] * z[r, c];
    }

    // This is a variation on the previous in which we use a shared             
    // atomic result vector rather than a shared race-y array or a              
    // private per task vector.  The downside is that each '+'                  
    // operation becomes a more expensive atomic '+', along with the            
    // cache downsides of sharing a result vector.                              
    when atomicPar {
      var yAt: [columns] atomic R;
      forall (r,c) in zD do
        yAt[c].add(x[r] * z[r, c]);
      y = yAt.read();
    }

    otherwise {
      compilerError("Unsupported algorithm");
    }
  }

  // scale the result vector                                                    
  y /= h;

  // return the result vector                                                   
  return y;
}

What was most intruiging about this experiment was that I expected the two illegal versions to be the fastest (because of the lack of need for any synchronization), yet incorrect (for the same reason). However, on my laptop at least, they were not only wrong, but also not the fastest (presumably because of the need for multiple tasks to battle over the shared vector in the cache?). For example:

$ ./testit --m=30000 --n=30000 --verify=false
approach naiveForall took: 0.478617
NOTE: naiveForall isn't guaranteed to give correct results
approach naiveForallExpr took: 0.513474
NOTE: naiveForallExpr isn't guaranteed to give correct results
approach taskPrivPar took: 0.448558
approach atomicPar took: 2.1744

There is a part of this conversation that remains somewhat mysterious to me however, though I feel I have to trust that you know what you're doing. Specifically, when I think of linear algebra and matrix-vector multiplications, I imagine the computation to be in this slightly different form:

// x would be of the form:  var x: [cols] R;
  var y : [rows] R;

  forall (r,c) in zD with (+ reduce y) do
    y[r] += z[r, c] * x[c];

to match the mathematical y = Z*x; yet your computation is more like the transpose of this. Is there a good intuition to get me out of this mental trough and on-board with what you've written?

I mention this because writing the naive versions of my form results in code that is typically legal in Chapel by default as long as the number of columns exceeds the number of tasks (cores). The reason for this is because when we parallelize by chunking up the rows, the result is that each task has a unique sub-vector of the solution to write to (and this is the case that motivates partial reductions from a performance perspective more than yours—since the implementation of that partial reduction could check for the typically caveat above and fall back on the more expensive solution if need be).

Anyway, I hope this solution might be helpful and useful, at least until we get a chance to return to partial reductions.

Thanks,
-Brad

Thanks for this. I will try it early next month when I am less swamped
with administrivia.

I notice you do not this:

var y : [columns] R = 0:R

but instead just do

var y : [columns] R;

Can one really assume that it will always be initialized to zero?

Also remember that the key performance benchmark is that in serial mode,
i.e. wrapped in a serial block, the Chapel code is competitive with some
Fortran 77 implementation, i.e. serial Fortran.

Stay safe - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Hi Damian —

Yep! Chapel ensures that all variables are default initialized unless the programmer takes pains to avoid it. And for numerical types, the default values are 0.

I have not done that comparison, but will be curious if you get a chance to try it out. Keep in mind that parallel Chapel code within serial blocks still has overheads, so that's not quite apples-to-apples. For example, we'd still create numCores tasks, give them each their own task-local copy of the vector, and reduce those results—simply running those tasks serially rather than in parallel. This is obviously overkill for a serial code, but we haven't spent much time optimizing the behavior of code within serial blocks (in practice we find they aren't used much for "real codes" and tend to be more interested in parallel-to-parallel performance comparisons).

-Brad

Hi Damian ?

  [40.png] damianmoz:

  Can one really assume that it will always be initialized to
  zero?

Yep! Chapel ensures that all variables are default initialized unless the
programmer takes pains to avoid it. And for numerical types, the default
values are 0.

Code that I write assuming that concept would not pass my organization's
programming guidelines. So I need to initialize the data. Otherwise my
staff will complain that the boss does not follow his own rules!!!!

  [40.png] damianmoz:

  Also remember that the key performance benchmark is that in serial
  mode, i.e. wrapped in a serial block, the Chapel code is
  competitive with some Fortran 77 implementation, i.e. serial
  Fortran.

I have not done that comparison, but will be curious if you get a chance
to try it out.

Done. Details are in my yet-to-be-accepted CHIUW 2021 paper.

Keep in mind that parallel Chapel code within serial
blocks still has overheads, so that's not quite apples-to-apples.

I compared

serial
{
	// calls routines using 'forall'

}

against

// serial
{
	// manually replace ALL every 'forall' with 'for'
}

At least for the example I have very heavily tested, a Singular Value
Decomposition, the difference was peanuts. So I now activate/deactivate
the the 'serial' clause with a boolean expression.

serial s == 'y' // 's' is a config const set to 'y' for serial mode

In reference to your final thoughts, I cannot see the point of optimizing
the parallel constructs within a serial block.

Thanks - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Young Brad,

proc fudge(h : ?R, const ref x : [] R, const ref z : [?zD] R, param approach)
{
const (rows, columns) = zD.dims();
var y : [columns] R;

forall (r,c) in zD with (+ reduce y) do
y[c] += x[r] * z[r, c];

y /= h;

return y;
}

Nice. That looks like it was way beyond my skill set.

The approach here is roughly:

  • use a forall so that the 2D array can be iterated over in parallel and
    in the way that the array author thinks is best (e.g., cache friendly)
  • use a with-clause on the forall to give each task its own copy of the
    result vector y so that (a) there are no race conditions due to trying
    to update a shared vector in parallel, and (b) there's no battling over
    the cache line that a single shared vector would live on

I need to understand the 'with' clause better but it looks good to me.

When you say that each task has its own copy of the result vector 'y',
does that mean that there is a merge by summation of those copies at the
end for the 'forall' implied by the

with (+ reduce y)

Also, what is the difference between

return y / h

and

y /= h;

return y;

Just curious.

Thanks for your help - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Hi Damian -

In general, when multiple tasks are created by a Chapel concept like forall or coforall, outer variables are "passed" to those tasks in a way that's dependent on the type (similar to default intents for arguments to Chapel procedures). For example, integers result in a per-task const copy of the int to prevent obvious race conditions. So in general, with clauses are a way to override these defaults and relay outer variables to tasks using an explicit strategy.

In this case, the + reduce strategy specifies that each task should get its own private copy of the variable, and that they should be combined with the original variable using + as the tasks complete.

Correct.

These compute the same computation, but I rewrote it this way thinking that by returning local variable y, I might enable the compiler to "steal" its value back to the s at the callsite to fudge() rather than copying the whole array out. In contrast, return y / h; suggests to me that the compiler will have to insert a new array to capture the result of y / h and then steal that array back. I didn't time this, though, and it's probably in the wash compared to the O(n**2) matrix-vector multiplication, so if you prefer the other style, I wouldn't feel the need to switch.

-Brad

Maybe the boss just needs to relax his rules when using a productive language. No sense in carrying all of your baggage forward with you. :slight_smile:

-Brad

My experience is that default initialization makes formal verification too
hard. I have yet to see a case where it makes sense in any language except
where the default initialization of an array is much faster than can be
expressed within that language. But that is a flaw in the language.

And with the ability to delay defining a variable until the point where it
is first used which has been around by design in C++ and Chapel, there is
even less need for default initialization.

Assuming an array is initially zero is generally an error. Even with a
scalar, it is an error more often than not. I find the '--undef' flag in
Fortran picks up bugs every time. But my experience is by definition
limited.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

This needs to be reformatted for Markdown.

At the risk of stealing the thunder from my paper, SVD has 4 phases:

H	Householder reduction
R	Compute Right Factor Matrix
L	Compute Left Factor Matrix
Q	QR Iteration

Below are timings for 2 separate runs for a serial Fortran implementation,
a Chapel implementation with Brad's fudged partial reduction in serial
mode, and my own cache-unfriendly inner product computation of that same
reduction which is also in serial mode.

Where the partial reduction is the dominant floating point operation, Brad's
fudged partial reduction is only just over 5% slower than the best Fortran
can do in the computation of the Factor Matrices.

Until Chapel can do decent vectorization or provides a mechanism where this
can be done by the programmer, the 'H' stage in Chapel will always lag that
of Fortran.

My total rework of the QR iteration allows it to be done with row-dominant
matrix algebra. This gets the QR stage to only an average of 17% slower
than Fortran and is my contribution to mankind!

Fortran:

H   3.70922899
R   1.92854881
L   1.93454218
Q   8.03622437

H   3.59914017
R   1.86908984
L   1.87104893
Q   7.82406759

Brad's fudged partial reduction:

H done 4.33814
R done 1.92019	fudged partial reduction dominant operation
L done 1.96702	ditto
Q done 9.14341

H done 4.34481
R done 1.91657
L done 1.96837
Q done 9.19123

Damian's cache-unfriendly method

H done 7.22981
R done 4.86092
L done 4.86309
Q done 9.1419

H done 7.22735
R done 4.8409
L done 4.86019
Q done 9.11394

Apologies for the lousy layout.

It looks like I can trim some words from my draft paper because Brad's
fudged partial reduction makes Chapel totally competitive with Fortran.

  • Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

1 Like
var y : [columns] R = 0:R

[ (r,c) in zD with (+ reduce y) ] y[c] += x[r] * z[r, c];

y /= h;

return y;

In the end, I went with the above. Nice and tight. Very concise. Clear.

I played with Markdown embedded in the email. Will see how it turned out

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer