Nested foralls fail with llvm backend in over-relation soln of Laplace's equation

Hello: I have just noticed that nesting two foralls in a very naive implementation of the over-relation solution of Laplace's equation fails if the compiler's back-end is llvm. In the code below, compiling with chpl --fast laplace-sor-p.chpl and running fails; compiling with chpl --no-llvm --fast laplace-sor-p.chpl works as expected.

I know that a better algorithm exists that updates the u's on a "checkers board", so that all new (say "black positions") values are updated using the old (say "white positions") values instead of the completely random update of the code below. I intend to verify if this option passes through llvm without problems.

I am not claiming this is a compiler's bug, since the code below is quite an agressive and thoughtless (naive) implementation, but it may be of interest to know that gcc and llvm backends produce different results for exactly the same source code.

Cheers

Nelson

// ===================================================================
// ==> laplace-sor-p: brute force (forall) paralellization of
// over-relaxation with Laplace's equation
// ===================================================================
use Time only stopwatch;
config const epsilon = 1.0e-8;     // accuracy
config const omega = 1.95;         // relaxation factor
config const Nn = 128;             // number of internal points 
const dl = 1.0/Nn;                 // delta l
writef("# Nn = %9i\n", Nn);
writef("# dl = %9.4dr\n", dl);
var u: [0..Nn,0..Nn] real = 0.0;   // iterative solution
for i in 0..Nn do {                // the initial guess
   for j in 0..Nn do {
      u[i,j] = IG(i,j);
   }
}
var runtime: stopwatch;
var deltam = epsilon;              // the norm
var nc = 0;                        // # of iterations till convergence
runtime.start();
while deltam >= epsilon do {       // change to parallel                @\label{lin:lapsorp-while}@
   deltam = 0.0;
   var deltaxm: [1..Nn-1] real ;
   forall i in 1..Nn-1 do {
      var deltaym: [1..Nn-1] real;
      forall j in 1..Nn-1 do {
         var uavg = (u[i+1,j]+u[i-1,j]+u[i,j-1]+u[i,j+1])/4.0;
         var deltau = omega*(uavg - u[i,j]);
         u[i,j] += deltau;
         deltau = abs(deltau);
         deltaym[j] = deltau;
      }
      deltaxm[i] = (+ reduce deltaym)/(Nn-1);
   }
   deltam = (+ reduce deltaxm)/(Nn-1);
   writeln("deltam = ", deltam);
   nc += 1;
}                                  // done                              @\label{lin:lapsorp-endwhile}@
runtime.stop();
writeln("nc = ",nc);
var sum = 0.0;
var mad = 0.0;
for i in 0..Nn do {
   for j in 0..Nn do {
      sum += u[i,j];
      var del = abs(u[i,j] - ana(i,j));
      mad += del;
   }
}
mad /= ((Nn+1)**2);
writeln("uavg = ",sum/((Nn+1)**2));
writeln("mad = ",mad);
writeln("elapsed time = ",runtime.elapsed()," s");
// -------------------------------------------------------------------
// --> initial guess
// -------------------------------------------------------------------
private inline proc IG(i,j: int): real {
   const x = i*dl;
   const y = j*dl;
   if i == 0 then {
      return y;
   }
   else if i == Nn then {
      return 1.0;
   }
   else if j == 0 then {
      return x;
   }
   else if j == Nn then {
      return 1.0;
   }
   else {
      return 0.50;
   }
}
// -------------------------------------------------------------------
// --> the analytical solution
// -------------------------------------------------------------------
private inline proc ana(i,j): real {
   const x = i*dl;
   const y = j*dl;
   return x + y - x*y;
}

Hi Nelson —

Can you clarify what you mean by "fails" here? I was imagining some sort of execution-time error or core dump, but with Chapel 1.29.0, the code is seemingly running to completion, though I have no idea if the output is correct or not. E.g., see Run – Attempt This Online which uses the LLVM back-end.

Also, what version of Chapel are you using?

My initial guess was that using --fast might be suppressing some other problem like an OOB array access, but compiling without --fast, I'm not seeing any such errors. My next guess would be that there could be a race in the expression of the algorithm itself that is leading to some sort of failure. But tell me more about what you're experiencing before we look further here.

Thanks,
-Brad

Hi Nelson —

Taking another look with 1.30 to make sure the behavior hasn't changed, I'm noticing now the infs and nans in the --fast output, which may be what you mean by "fails"? If so, it looks to me as though there's a race in your code in that the inner parallel loop is both reading and writing u in an uncoordinated and stencil-oriented way. So my guess would be that the difference in behavior could be due to different timining characteristics of the generated code and the likelihood of the race occurring or not?

To make sure the race is clear, if I own the j==5 iteration and am reading u[i+1, 5], u[i-1, 5], u[i, 4], u[i, 6] and then going on to write u[i, 5] and you own the j==6 iteration, one of the things you'll read is u[i, 5] which I'll be writing. So the value you read will depend on whether you are scheduled before or after me in the parallel iteration of the loop?

Let me know if I'm mistaken and missing something more subtle that would guard against that.

-Brad

Hi Brad:

[quote="bradcray, post:3, topic:20559"]

Taking another look with 1.30 to make sure the behavior hasn't changed, I'm noticing now the infs and nans in the --fast output, which may be what you mean by "fails"?

Exactly. The variable deltam keeps increasing (it should decrease) until when nans and infs happen. Apparently, the outermost while returns false when deltam is no longer a valid floating point. The program stops, but the "solution" is nans ...


If so, it looks to me as though there's a race in your code in that the inner parallel loop is both reading and writing u in an uncoordinated and stencil-oriented way. So my guess would be that the difference in behavior could be due to different timining characteristics of the generated code and the likelihood of the race occurring or not?

Again, probably yes.


To make sure the race is clear, if I own the j==5 iteration and am reading u[i+1, 5], u[i-1, 5], u[i, 4], u[i, 6] and then going on to write u[i, 5] and you own the j==6 iteration, one of the things you'll read is u[i, 5] which I'll be writing. So the value you read will depend on whether you are scheduled before or after me in the parallel iteration of the loop?

Yes. So my guess is that gcc and llvm are scheduling the access to u[i,j]s differently.
Maybe it is just by chance that gcc works and llvm does not. The way I programmed it, I was aware that the order in which the u[i,j]s are updated and accessed is completely random, but for serial algorithms this is still taught (in old references by now) as a valid approach. A very readable reference is

Hansen, P. B.
Numerical solution of Laplace's equation
Syracuse University, College of Engineering and Computer Science, Syracuse University, College of Engineering and Computer Science, 1992

And this "brute force no matter which [i,j] comes first" is the Gauss-Seidel method (algorithm 6): I can send you the pdf if it is hard to find.


Let me know if I'm mistaken and missing something more subtle that would guard against that.

The upshot is that you are spot on right. As I mentioned, it may be asking too much for brute-force parallelization of Gauss-Seidel to work because so many u[i,j]s are simultaneously being updated, while in the serial version only one is updated at a time.

A more careful (and smart) algorithm is to update according to "parity" (black or white squares) on a chessboard. This ensures that only "old" (black) values are used to update the "new" (white) ones while only needing to keep one 2D array u in memory. This is Algorithm 8 of the reference above,
and now I have tested it both in the serial and parallel (two nested foralls) versions. With this "parity ordering", no race conditions happen and the parallel version works fine.

Regards, (and sorry for the boldface,which seems to be generated by "-------")

Nelson

The code generated by both should be roughly similar (in terms of number of tasks, who does what work, etc.), but the quality or specifics of the code generated could differ, which could result in different timing characteristics. Specifically, I'm noting that removing the --fast (which will change both the generated code and timing characteristics) changes the results as well.

I don't have time at the moment to familiarize myself with the algorithm and the degree to which is would or would not be safe to parallelize in this way in Chapel, but let me see whether someone else on the team might.

-Brad

OK! and by the way, I can confirm that compiling without --fast but with llvm produces the correct result. I am using 1.29

regards,

Nelson

Hi Nelson!

I know that a better algorithm exists that updates the u's on a "checkers board", so that all new (say "black positions") values are updated using the old (say "white positions") values instead of the completely random update of the code below. I intend to verify if this option passes through llvm without problems.

I'm not sure whether this will be helpful to you, but a little while ago I posted an example of some Chapel code that uses the Checkerboard algorithm you're talking about on another thread. See this message: Code optimization - #24 by jcorrado

That code is part of a 3D Poisson solver that uses SOR, so it might be similar enough to be a useful reference for you.

Please let me know if you have any questions.

Best,
Jeremiah

Many thanks Jeremiah! I will take a good look.

best

Nelson

1 Like