Code optimization

Unblock 6 post limit

1 Like

The computations done in sor_method of your Fortran code are serial not parallel. This is not least because the recursive nature of the SOR formula is non-trivial to parallelise using simple concepts.

The internals of your forall version of the sor_method routine uses a forall statement which produces numbers that disagree with those produced by the serial Fortran code or the serial Chapel code. That is because the last two evaluate the recursive statements in the correct order. While the numbers produced in parallel using a forall are wrong, when you use a convergence tolerance erro of 1.0e-6 in the code, over 95% of the data is within sqrt(1.0e-6) of the correct value. But that does not change the fact that they are still wrong.

So if you are trying to do justice to your parallel Chapel code when comparing it to your original MPI (parallel) Fortran, you need to avoid a forall in sor_method. On that note, I note that the run-time for your Chapel code includes the time to write the results to a file. But the run-time for your Fortran code does NOT include that time. The Fortran code times only the computations. That means for the same computations, your Chapel run-time appears much slower that it really is. You should move the statement

runtime.stop()

to come before the routine escreve which I assume is Portugese for write.

I can point you at suggestions on how to code the finite difference based sor method in a parallel fashion. However, there are probably others in the Chapel community whose advice will be far superior to mine who might like to make some enlightened suggestions.

Trying to write a parallel sor_method routine will be far easier in Chapel than in Fortran+MPI. My 2c.

Hello @damianmoz and @annacfs, wrt parallelizing the SOR method, one possible option is to use a "checker-board" approach as described on this page: CS267: Notes for Lectures 15 and 16, Mar 5 and 7, 1996.

The basic idea is to split your array into two groups of non-adjacent elements (in 2D, this ends up looking like the squares of a checkerboard). Then, the new values of each group can be computed in parallel, because they don't rely on the values of the neighboring elements.

I've put together a modified version of your code that uses this approach. It also makes some small adjustments to the array initialization, and simplifies the calc_erro proc by doing a maximum reduction at the same time as the computation.

I can't be 100% sure that this will match results from your original Fortran code, but hopefully it's a helpful start (at least conceptually) for parallelizing that loop:


use Time;
var runtime: Timer;

// set up domains
config const imax, jmax, kmax = 10;
const domPoisson = {1..imax, 1..jmax, 1..kmax},     // the full domain
      dpInner = domPoisson[domPoisson.expand(-1)];  // the full domain, not including boundaries

// set up arrays
var f, tf : [domPoisson] real = 0.0;

// set up other parameters
const Lx = 1.0, Ly = 1.0,  Lz = 1.0 : real;
const dx = Lx/imax, dy = Ly/jmax, dz = Lz/kmax : real;
const w  = 1.3 : real;
const dx2 = dx*dx, dy2 = dy*dy, dz2 = dz*dz: real;
const beta1 = dx2/dy2,
      beta2 = dx2/dz2 : real;
const tol = 1E-6: real;
const NPRIN = 500;

var i, j, k: int;
var itera: int = 0;
var erro : real = 1E0;
var max_f, min_f : real;

// create arrays of x,y,z coordinates so that we don't have to recompute them during iteration:
const x_coords = [i in domPoisson.dim(0)] (i - 0.5) * dx,
      y_coords = [j in domPoisson.dim(1)] (j - 0.5) * dy,
      z_coords = [k in domPoisson.dim(2)] (k - 0.5) * dz;

// initialize source term:
forall (i, j, k) in domPoisson do
    tf[i, j, k] = -3.0 * sin(x_coords[i]) * cos(y_coords[j]) * sin(z_coords[k]);

// initialize boundary conditions:
forall j in domPoisson.dim(1) {
    for k in domPoisson.dim(2) {
        f(1, j, k) = sin(0.5) * cos(y_coords[j]) * sin(z_coords[k]);
        f(imax, j, k) = sin(x_coords[imax]) * cos(y_coords[j]) * sin(z_coords[k]);
    }
}

forall i in domPoisson.dim(0) {
    for k in domPoisson.dim(2) {
        f(i, 1, k) = sin(x_coords[i]) * cos(0.5) * sin(z_coords[k]);
        f(i, jmax, k) = sin(x_coords[i]) * cos(y_coords[jmax]) * sin(z_coords[k]);
    }
}

forall i in domPoisson.dim(0) {
    for j in domPoisson.dim(1) {
        f(i, j, 1) = sin(x_coords[1]) * cos(y_coords[j]) * sin(0.5);
        f(i, j, kmax) = sin(x_coords[1]) * cos(y_coords[j]) * sin(z_coords[kmax]);
    }
}

// temporary copy of 'f'
var f_next = f;

runtime.start();
while erro  >= tol {
    // compute 'f_next'
    sor_parallel(f_next, f, tf, dx2, beta1, beta2, w);

    // swap 'f_next' into 'f' for the next iteration
    f <=> f_next;

    // compute the error residual
    erro = calc_erro(f, tf, dx2, dy2, dz2);

    itera += 1;
    if (mod(itera,NPRIN) == 0) {
        writef('%i %6.12dr\n', itera, erro);
    }
}
runtime.stop();

escreve(f, imax,  jmax, kmax, dx, dy, dz);

max_f = max reduce(f);
min_f = min reduce(f);

writef("%12.8er   %12.8er \n", max_f, min_f);
writef(" CPU TIME CHAPEL  = %10.16dr \n", runtime.elapsed());
writef("%i   %12.8er \n",itera, erro);
writef("------------------------------------------------------------------------------ \n");

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

proc sor_parallel(ref f_next, const ref f, const ref tf, dx2: real, beta1: real, beta2: real, w: real) {

    // update the "red spaces" in the checkerboard domain in parallel
    forall (i, j, k) in dpInner {
        if ((i+j)%2 == 0 && k%2 == 0) || ((i+j)%2 != 0 && k%2 != 0) {
            f_next(i,j,k) = (1.0 - w) * f(i, j, k) + w *
                    (-dx2*tf(i,j,k) + f(i-1,j,k) + f(i+1,j,k) +
                    beta1*( f(i,j-1,k) + f(i,j+1,k) ) +
                    beta2*( f(i,j,k-1) + f(i,j,k+1) )) /
                    ( 2.0*(beta1 + beta2 + 1.0) );
        }
    }

    // update the "black spaces" in the checkerboard domain in parallel
    forall (i, j, k) in dpInner {
        if ((i+j)%2 != 0 && k%2 == 0) || ((i+j)%2 == 0 && k%2 != 0) {
            f_next(i,j,k) = (1.0 - w) * f(i, j, k) + w *
                    (-dx2*tf(i,j,k) + f_next(i-1,j,k) + f_next(i+1,j,k) +
                    beta1*( f_next(i,j-1,k) + f_next(i,j+1,k) ) +
                    beta2*( f_next(i,j,k-1) + f_next(i,j,k+1) )) /
                    ( 2.0*(beta1 + beta2 + 1.0) );
        }
    }
}

proc calc_erro(f, tf, dx2: real, dy2: real, dz2: real): real {
    // do a min-reduction during error calculation,
    //      so that results don't have to be stored in 'residuo'
    var max_error: real;
    forall (i, j, k) in dpInner with (max reduce max_error) {
        max_error reduce= abs(tf(i,j,k) -
            (f(i-1,j,k) - 2.0*f(i,j,k) + f(i+1,j,k))/dx2 -
            (f(i,j-1,k) - 2.0*f(i,j,k) + f(i,j+1,k))/dy2 -
            (f(i,j,k-1) - 2.0*f(i,j,k) + f(i,j,k+1))/dz2);
    }
    return max_error;
}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

proc escreve(f, imax: int, jmax: int, kmax: int, dx: real, dy: real, dz: real){
    use IO;
	var erro: real;
    const saida = openwriter("saida_chapel_par.csv");

    saida.writef("  %s  %s   %s  %s  %s  %s  %s  %s %s \n",  "x", ",", "y", ",", "z", ",", "f", ",", "e");
    for (i, j, k) in domPoisson {
        erro = abs( f(i, j, k) - (sin(x_coords[i]) * cos(y_coords[j]) * sin(z_coords[k])) );
        saida.writef("%10.12dr  %s %10.12dr   %s  %10.12dr  %s  %10.12dr  %s  %10.12dr  \n",
        		            x_coords[i], ",", y_coords[j], ",", z_coords[k], ",", f(i,j,k), ",", erro);
    }

    saida.close();
}

Can you please point me at the documentation that tells me how to reproduce your use of reduce please? Or explain what you are doing? That looks pretty cool and I want to learn such advanced usage.

@damianmoz - these are called "reduce intents". The use of reduce= is optional and can be replaced by the desired operation, like max() in this case.

Please look for 'reduce=' on these pages:
https://chapel-lang.org/docs/main/primers/forallLoops.html
https://chapel-lang.org/docs/main/language/spec/data-parallelism.html#forall-intents

Thanks for that reference Vass. Not sure I understand it. I might summarize all the examples I can see in the distribution and see if the explanation jumps out at me.

@damianmoz In case this helps, here is an example that computes four reductions over an array all at once, with the ability to add more per-element computations:

var xSum = 0,
    xProd = 1,
    xMax = min(int),
    xMin = max(int);

forall a in MyArray with (+ reduce xSum, * reduce xProd,
                          max reduce xMax, min reduce xMin)
{
  // whatever other computations you want
  xSum  reduce= a;
  xProd reduce= a;
  xMax  reduce= a;
  xMin  reduce= a;
}

writeln((xSum, xProd, xMax, xMin));

In experimenting with my enhanced knowledge of reduce= courtesy of @vass and @jcorrado while I am trying to write some extended documentation on the use of **reduce=^^, I took the Jeremiah(@jcorrado) approach that only creates elements of the matrix to be reduced (and avoids the creation of the matrix) per se. With a minor improvement to the algebra, and avoiding reduce=, I get

	proc calc_erro(f, tf,  dx2: real, dy2: real, dz2: real)
	{
		param one = 1.0;
		const rdx2 = one / dx2;
		const rdy2 = one / dy2;
		const rdz2 = one / dz2;
		const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);

		return max reduce
		(
			forall (i,j,k) in dpInner do abs
			(
				((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
				((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
				((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k]
			)
		);
	}

The performance matches the @jcorrado variant.

I avoided the red-black ordering and instead computed each diagonal wave in parallel, and I get exactly the same results as the serial case, but, with the enhanced calc_erro routine, That said, except for the gain seen from the improvement to calc_erro, it still has a limit to the number of cores where you see a performance gain.

With the red-black orderiing implemented, while the convergence of the maximum error happens in much the same number of cycles, the individual errors is orders of magnitude worse. That said, the non highly psrallelized performance not only rocks but also scales as one would hope. I just wish I fully understood how red-black ordering works as well as it does. But here is not the place to discuss the intricacies of the SOR method.

As I said, the parallelized red-black ordering of the SOR method in Chapel absolutely rocks. And is easy to code in Chapel.

I reworked sor_method to be

    proc sor_method(f, tf,  dx2: real, beta1: real,  beta2: real,  w: real)
    {
        param one = 1.0;
        param half = 0.5;
        const w1 = (one - w);
        const w2 = (half * w) / (one + beta1 + beta2);
        const innerIJK = f.domain.expand(-1);
        const g = f;

        // update the "red spaces" in the checkerboard domain in parallel

        forall (i, j, k) in innerIJK
        {
            if ((i+j)%2 == 0 && k%2 == 0) || ((i+j)%2 != 0 && k%2 != 0)
            {
                f(i,j,k) = w1 * g(i, j, k) + w2 *
                    (-dx2*tf(i,j,k) + g(i-1,j,k) + g(i+1,j,k) +
                    beta1*( g(i,j-1,k) + g(i,j+1,k) ) +
                    beta2*( g(i,j,k-1) + g(i,j,k+1) ));
            }
        }

        // update the "black spaces" in the checkerboard domain in parallel

        forall (i, j, k) in innerIJK
        {
            if ((i+j)%2 != 0 && k%2 == 0) || ((i+j)%2 == 0 && k%2 != 0)
            {
                f(i,j,k) = w1 * g(i, j, k) + w2 *
                    (-dx2*tf(i,j,k) + f(i-1,j,k) + f(i+1,j,k) +
                    beta1*( f(i,j-1,k) + f(i,j+1,k) ) +
                    beta2*( f(i,j,k-1) + f(i,j,k+1) ));
            }
        }
    }

This avoids the need for f_next in the main program so that also means that the swap

    // swap 'f_next' into 'f' for the next iteration
    f <=> f_next;

is hence redundant. This interface demands s that it be called within the main iterative loop as just

    sor_parallel(f, tf, dx2, beta1, beta2, w);

I also tidied up the algebra to avoid the division which saves a few a bit - about 3% on my system.