Code optimization

If vectorization is actually working well, the following might be better

    proc sor_kernel(_k, f, tf,  dx2: real, beta1: real)
    {
        var b : [_k] real;
        const ref fijm1 = f[i, j-1, ..];
        const ref fijp1 = f[i, j+1, ..];
        const ref fim1j = f[i-1, j, ..];
        const ref fip1j = f[i+1, j, ..];
        const ref tfij = tf[i, j, ..];

        foreach k in _k do // this can be vectorized
        {
            // there are superscalar and FMA opportunities here
            // note this ordering might be suboptimal here goes

            const b1 = fijm1[k] + fijp1[k];
            const b0 = fim1j[k] + fip1j[k];

            b[k] = b1 * beta1 + b0 - dx2 * tfij[k];
        }
        return b;
    }
    proc sor_method(f, tf,  dx2: real, beta1: real,  beta2: real,  w: real)
    {
        // recompute what is stored inside Inner to avoid
        // using data outside the scope of this routine
        const ijk = f.domain.dims();
        const _i = ijk[0].expand(-1); // 2..imax-1
        const _j = ijk[1].expand(-1); // 2..jmax-1
        const _k = ijk[2].expand(-1); // 2..kmax-1
        const w1 = (1.0 - w);
        const w2 = (0.5 * w) / (beta1 + beta2 + 1.0);

        forall i in _i do // if imax is small, this might be sub-optimal
        {
            // the entire loop contents will be run as a single task
            // so ensure that each such task has lots of work to do

            for j in _j do
            {
                ref fij = f[i, j, ..];
                var fijk = f[i, j, 1];
                const b = sor_kernel(_k, f, tf, dx2, beta1);

                for k in _k do // recursive - cannot be vectorized - sigh!!
                {
                    const b2 = fijk + fij[k+1];

                    fijk = w1 * fij[k] + w2 * (b2 * beta2 + b[k]);
                    fij[k] = fijk;
                }
            }
        }
    }

Try it. Experiment.

Nice!!!!! Let me think about it!
Thank u

I have not tested this. I hope I have not lost anything in the cutting and pasting I did. Others may have some better ideas. I am not sure how well the currrent version of the compiler is at vectorizing the code, something about which @mppf might have way better insight than I. Are you using the LLVM or GCC based variant of the compiler, and again, others can probably better advise you here but that will influence how well your code will vectorize. Maybe @jcorrado has some insightt here as well?

What are representative values of the problem size, i.e. what are the values of imax, jmax and kmax?

See also the discussion on Github issue #19636 which is related.

I tested imax=kmax=kmax = 130 in cluster. The size of the problem is on the order of 10^6; But my goal is to do 10^9

Thanks. My idea of using the forall on the outside loop is OK then.

Note that if you ultimately want to run this on multiple nodes of a cluster, you might need to use coforall and tweak the design of sor_method. But that can wait for now. Get things running nicely on your single node for now. I assume that the single node you are using in the cluster has two 28-core CPUs, meaning you have 56 cores available to you.

Also, in my opinion, your need to turn multi-threading off on your system. We always get better performance having a single thrread for each FPU. Others might have different experience using Chapel and I suggest you get their advice.

You could/should also look at unrolling the innermost loops.

I would also be curious whether a foreach loop is faster than an for loop unrolled by 4. If the former is no faster, then the original suggestion, with its innermost loop unrolled by at least 2 (or maybe 4), would probably be optimal.

You should force imax, jmax,and kmax to be an odd multiple of 2, This means the number of elements in a range like

2..imax-1

is always a multiple of 4. This makes loop unrolling a lot easier to implement. My 2c.

I ran your task on a 24-core system,. I expected a gain of about 11,

With the original code, I get
a) a serial performance of 690 seconds in 15000+ iterations
b) a parallel performance of 53 seconds in 22000+ iterations

The performance gain is about a factor of 13 which is slightly better that I thought. Do you numbers agree to within some tolerance? The parallelization of that recursive formula worries me.

When I run the modified version using the approach I suggested, I get
b) a parallel performance of 73 seconds in 15000+ iterations

The performance gain is about 9.5 which is still pretty good and the iteration count matches.

EDITED after the event

I rewrote in a simpler style, still handling the recursive aspects of the 3rd dimension loop, and the performance gain is now nearly 22 on the 24 core machine (with a total of 8 memory channels) and nearly 28 for the 52-core machine (with a total of 12 memory channels). The figure of 28 is misleading because the serial version runs at 3.9Ghz if it is the only task on this CPU whereas the parallel version with 52 cores runs at 2.1Ghz. If you scale that up, you get a factor of about 50 on a 52 core system. I will post the code tomorrow.

That said, I think there is some issue the allocation of cores on your cluster.

Can you please provide the Fortran+MPI reference code?

What do your 2 graphs mean? What serial run time do you get?

When you say 28 cores, 56 threads, what are the actual CPUs, i.e.

cat /proc/cpuinfo

on the Linux machine. The machine I am using has two 12-core Xeon 2650-v4 CPUs, so a total of 24 cores (and as I noted, only 24 threads which avoids 2 separate threads fighting over access to a single FPU as there as only as many FPUs as there are cores).

I also ran it on a two 18core Xeon 2695v4 which has more cores and there is no performance gain. The performance is somewhat dominated by memory access (or cache misses) so you quickly hit the limit. When I run it on a two 26core Xeon Gold 6230R with 50% more memory bandwidth, I see the run-times for the parallel version drop by nearly 50%.

Unblock 6 post limit

1 Like

The computations done in sor_method of your Fortran code are serial not parallel. This is not least because the recursive nature of the SOR formula is non-trivial to parallelise using simple concepts.

The internals of your forall version of the sor_method routine uses a forall statement which produces numbers that disagree with those produced by the serial Fortran code or the serial Chapel code. That is because the last two evaluate the recursive statements in the correct order. While the numbers produced in parallel using a forall are wrong, when you use a convergence tolerance erro of 1.0e-6 in the code, over 95% of the data is within sqrt(1.0e-6) of the correct value. But that does not change the fact that they are still wrong.

So if you are trying to do justice to your parallel Chapel code when comparing it to your original MPI (parallel) Fortran, you need to avoid a forall in sor_method. On that note, I note that the run-time for your Chapel code includes the time to write the results to a file. But the run-time for your Fortran code does NOT include that time. The Fortran code times only the computations. That means for the same computations, your Chapel run-time appears much slower that it really is. You should move the statement

runtime.stop()

to come before the routine escreve which I assume is Portugese for write.

I can point you at suggestions on how to code the finite difference based sor method in a parallel fashion. However, there are probably others in the Chapel community whose advice will be far superior to mine who might like to make some enlightened suggestions.

Trying to write a parallel sor_method routine will be far easier in Chapel than in Fortran+MPI. My 2c.

Hello @damianmoz and @annacfs, wrt parallelizing the SOR method, one possible option is to use a "checker-board" approach as described on this page: CS267: Notes for Lectures 15 and 16, Mar 5 and 7, 1996.

The basic idea is to split your array into two groups of non-adjacent elements (in 2D, this ends up looking like the squares of a checkerboard). Then, the new values of each group can be computed in parallel, because they don't rely on the values of the neighboring elements.

I've put together a modified version of your code that uses this approach. It also makes some small adjustments to the array initialization, and simplifies the calc_erro proc by doing a maximum reduction at the same time as the computation.

I can't be 100% sure that this will match results from your original Fortran code, but hopefully it's a helpful start (at least conceptually) for parallelizing that loop:


use Time;
var runtime: Timer;

// set up domains
config const imax, jmax, kmax = 10;
const domPoisson = {1..imax, 1..jmax, 1..kmax},     // the full domain
      dpInner = domPoisson[domPoisson.expand(-1)];  // the full domain, not including boundaries

// set up arrays
var f, tf : [domPoisson] real = 0.0;

// set up other parameters
const Lx = 1.0, Ly = 1.0,  Lz = 1.0 : real;
const dx = Lx/imax, dy = Ly/jmax, dz = Lz/kmax : real;
const w  = 1.3 : real;
const dx2 = dx*dx, dy2 = dy*dy, dz2 = dz*dz: real;
const beta1 = dx2/dy2,
      beta2 = dx2/dz2 : real;
const tol = 1E-6: real;
const NPRIN = 500;

var i, j, k: int;
var itera: int = 0;
var erro : real = 1E0;
var max_f, min_f : real;

// create arrays of x,y,z coordinates so that we don't have to recompute them during iteration:
const x_coords = [i in domPoisson.dim(0)] (i - 0.5) * dx,
      y_coords = [j in domPoisson.dim(1)] (j - 0.5) * dy,
      z_coords = [k in domPoisson.dim(2)] (k - 0.5) * dz;

// initialize source term:
forall (i, j, k) in domPoisson do
    tf[i, j, k] = -3.0 * sin(x_coords[i]) * cos(y_coords[j]) * sin(z_coords[k]);

// initialize boundary conditions:
forall j in domPoisson.dim(1) {
    for k in domPoisson.dim(2) {
        f(1, j, k) = sin(0.5) * cos(y_coords[j]) * sin(z_coords[k]);
        f(imax, j, k) = sin(x_coords[imax]) * cos(y_coords[j]) * sin(z_coords[k]);
    }
}

forall i in domPoisson.dim(0) {
    for k in domPoisson.dim(2) {
        f(i, 1, k) = sin(x_coords[i]) * cos(0.5) * sin(z_coords[k]);
        f(i, jmax, k) = sin(x_coords[i]) * cos(y_coords[jmax]) * sin(z_coords[k]);
    }
}

forall i in domPoisson.dim(0) {
    for j in domPoisson.dim(1) {
        f(i, j, 1) = sin(x_coords[1]) * cos(y_coords[j]) * sin(0.5);
        f(i, j, kmax) = sin(x_coords[1]) * cos(y_coords[j]) * sin(z_coords[kmax]);
    }
}

// temporary copy of 'f'
var f_next = f;

runtime.start();
while erro  >= tol {
    // compute 'f_next'
    sor_parallel(f_next, f, tf, dx2, beta1, beta2, w);

    // swap 'f_next' into 'f' for the next iteration
    f <=> f_next;

    // compute the error residual
    erro = calc_erro(f, tf, dx2, dy2, dz2);

    itera += 1;
    if (mod(itera,NPRIN) == 0) {
        writef('%i %6.12dr\n', itera, erro);
    }
}
runtime.stop();

escreve(f, imax,  jmax, kmax, dx, dy, dz);

max_f = max reduce(f);
min_f = min reduce(f);

writef("%12.8er   %12.8er \n", max_f, min_f);
writef(" CPU TIME CHAPEL  = %10.16dr \n", runtime.elapsed());
writef("%i   %12.8er \n",itera, erro);
writef("------------------------------------------------------------------------------ \n");

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

proc sor_parallel(ref f_next, const ref f, const ref tf, dx2: real, beta1: real, beta2: real, w: real) {

    // update the "red spaces" in the checkerboard domain in parallel
    forall (i, j, k) in dpInner {
        if ((i+j)%2 == 0 && k%2 == 0) || ((i+j)%2 != 0 && k%2 != 0) {
            f_next(i,j,k) = (1.0 - w) * f(i, j, k) + w *
                    (-dx2*tf(i,j,k) + f(i-1,j,k) + f(i+1,j,k) +
                    beta1*( f(i,j-1,k) + f(i,j+1,k) ) +
                    beta2*( f(i,j,k-1) + f(i,j,k+1) )) /
                    ( 2.0*(beta1 + beta2 + 1.0) );
        }
    }

    // update the "black spaces" in the checkerboard domain in parallel
    forall (i, j, k) in dpInner {
        if ((i+j)%2 != 0 && k%2 == 0) || ((i+j)%2 == 0 && k%2 != 0) {
            f_next(i,j,k) = (1.0 - w) * f(i, j, k) + w *
                    (-dx2*tf(i,j,k) + f_next(i-1,j,k) + f_next(i+1,j,k) +
                    beta1*( f_next(i,j-1,k) + f_next(i,j+1,k) ) +
                    beta2*( f_next(i,j,k-1) + f_next(i,j,k+1) )) /
                    ( 2.0*(beta1 + beta2 + 1.0) );
        }
    }
}

proc calc_erro(f, tf, dx2: real, dy2: real, dz2: real): real {
    // do a min-reduction during error calculation,
    //      so that results don't have to be stored in 'residuo'
    var max_error: real;
    forall (i, j, k) in dpInner with (max reduce max_error) {
        max_error reduce= abs(tf(i,j,k) -
            (f(i-1,j,k) - 2.0*f(i,j,k) + f(i+1,j,k))/dx2 -
            (f(i,j-1,k) - 2.0*f(i,j,k) + f(i,j+1,k))/dy2 -
            (f(i,j,k-1) - 2.0*f(i,j,k) + f(i,j,k+1))/dz2);
    }
    return max_error;
}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

proc escreve(f, imax: int, jmax: int, kmax: int, dx: real, dy: real, dz: real){
    use IO;
	var erro: real;
    const saida = openwriter("saida_chapel_par.csv");

    saida.writef("  %s  %s   %s  %s  %s  %s  %s  %s %s \n",  "x", ",", "y", ",", "z", ",", "f", ",", "e");
    for (i, j, k) in domPoisson {
        erro = abs( f(i, j, k) - (sin(x_coords[i]) * cos(y_coords[j]) * sin(z_coords[k])) );
        saida.writef("%10.12dr  %s %10.12dr   %s  %10.12dr  %s  %10.12dr  %s  %10.12dr  \n",
        		            x_coords[i], ",", y_coords[j], ",", z_coords[k], ",", f(i,j,k), ",", erro);
    }

    saida.close();
}

Can you please point me at the documentation that tells me how to reproduce your use of reduce please? Or explain what you are doing? That looks pretty cool and I want to learn such advanced usage.

@damianmoz - these are called "reduce intents". The use of reduce= is optional and can be replaced by the desired operation, like max() in this case.

Please look for 'reduce=' on these pages:
https://chapel-lang.org/docs/main/primers/forallLoops.html
https://chapel-lang.org/docs/main/language/spec/data-parallelism.html#forall-intents

Thanks for that reference Vass. Not sure I understand it. I might summarize all the examples I can see in the distribution and see if the explanation jumps out at me.

@damianmoz In case this helps, here is an example that computes four reductions over an array all at once, with the ability to add more per-element computations:

var xSum = 0,
    xProd = 1,
    xMax = min(int),
    xMin = max(int);

forall a in MyArray with (+ reduce xSum, * reduce xProd,
                          max reduce xMax, min reduce xMin)
{
  // whatever other computations you want
  xSum  reduce= a;
  xProd reduce= a;
  xMax  reduce= a;
  xMin  reduce= a;
}

writeln((xSum, xProd, xMax, xMin));

In experimenting with my enhanced knowledge of reduce= courtesy of @vass and @jcorrado while I am trying to write some extended documentation on the use of **reduce=^^, I took the Jeremiah(@jcorrado) approach that only creates elements of the matrix to be reduced (and avoids the creation of the matrix) per se. With a minor improvement to the algebra, and avoiding reduce=, I get

	proc calc_erro(f, tf,  dx2: real, dy2: real, dz2: real)
	{
		param one = 1.0;
		const rdx2 = one / dx2;
		const rdy2 = one / dy2;
		const rdz2 = one / dz2;
		const rxyz = 2.0 * (rdx2 + rdy2 + rdz2);

		return max reduce
		(
			forall (i,j,k) in dpInner do abs
			(
				((f[i,j,k-1] + f[i,j,k+1]))*rdz2 - tf[i,j,k] +
				((f[i,j-1,k] + f[i,j+1,k]))*rdy2 +
				((f[i-1,j,k] + f[i+1,j,k]))*rdx2 - rxyz * f[i,j,k]
			)
		);
	}

The performance matches the @jcorrado variant.

I avoided the red-black ordering and instead computed each diagonal wave in parallel, and I get exactly the same results as the serial case, but, with the enhanced calc_erro routine, That said, except for the gain seen from the improvement to calc_erro, it still has a limit to the number of cores where you see a performance gain.

With the red-black orderiing implemented, while the convergence of the maximum error happens in much the same number of cycles, the individual errors is orders of magnitude worse. That said, the non highly psrallelized performance not only rocks but also scales as one would hope. I just wish I fully understood how red-black ordering works as well as it does. But here is not the place to discuss the intricacies of the SOR method.

As I said, the parallelized red-black ordering of the SOR method in Chapel absolutely rocks. And is easy to code in Chapel.

I reworked sor_method to be

    proc sor_method(f, tf,  dx2: real, beta1: real,  beta2: real,  w: real)
    {
        param one = 1.0;
        param half = 0.5;
        const w1 = (one - w);
        const w2 = (half * w) / (one + beta1 + beta2);
        const innerIJK = f.domain.expand(-1);
        const g = f;

        // update the "red spaces" in the checkerboard domain in parallel

        forall (i, j, k) in innerIJK
        {
            if ((i+j)%2 == 0 && k%2 == 0) || ((i+j)%2 != 0 && k%2 != 0)
            {
                f(i,j,k) = w1 * g(i, j, k) + w2 *
                    (-dx2*tf(i,j,k) + g(i-1,j,k) + g(i+1,j,k) +
                    beta1*( g(i,j-1,k) + g(i,j+1,k) ) +
                    beta2*( g(i,j,k-1) + g(i,j,k+1) ));
            }
        }

        // update the "black spaces" in the checkerboard domain in parallel

        forall (i, j, k) in innerIJK
        {
            if ((i+j)%2 != 0 && k%2 == 0) || ((i+j)%2 == 0 && k%2 != 0)
            {
                f(i,j,k) = w1 * g(i, j, k) + w2 *
                    (-dx2*tf(i,j,k) + f(i-1,j,k) + f(i+1,j,k) +
                    beta1*( f(i,j-1,k) + f(i,j+1,k) ) +
                    beta2*( f(i,j,k-1) + f(i,j,k+1) ));
            }
        }
    }

This avoids the need for f_next in the main program so that also means that the swap

    // swap 'f_next' into 'f' for the next iteration
    f <=> f_next;

is hence redundant. This interface demands s that it be called within the main iterative loop as just

    sor_parallel(f, tf, dx2, beta1, beta2, w);

I also tidied up the algebra to avoid the division which saves a few a bit - about 3% on my system.