Doubt on ```+ reduce``` applied to a 2D array

nelsonluisdias · May 28, 2025, 8:30pm

Hello:

The program below always gives a0sum == a1sum (both are more or less straightforward ways to sum all elements in the 2D array ad2phi); however, a2sum and a3sum (both calculated with + reduce) give different values from a0sum; moreover, sometimes a2sum == a3sum, and sometimes not: running the program several times gives different results. I wonder if + reduce is not meant for multidimensional arrays, or if I am using the syntax wrongly. (Done with chapel 2.4.0)

Thanks

Nelson

use Math only sin,cos,pi;
config const N = 16 ;
const del = 2*pi/(N+1);
var Di = {1..N,1..N};
writeln("Di = ",Di);
var ad2phi: [Di] real ;      
// -----------------------------------------------------------------------------
// the analytical values
// -----------------------------------------------------------------------------
var a0sum = 0.0;
for (i,j) in Di do {
   var x = -pi + i*del;
   var y = -pi + j*del;
   ad2phi[i,j] = anadel2(x,y);
   a0sum += ad2phi[i,j] ;
}
var a1sum = asum(ad2phi);
var a2sum;
a2sum = (+ reduce ad2phi);
var a3sum = (+ reduce ad2phi);
writeln("a0sum = ",a0sum);
writeln("a1sum = ",a1sum);
writeln("a2sum = ",a2sum);
writeln("a3sum = ",a3sum);
proc anadel2(
   const in x: real,
   const in y: real
   ) : real {
   return (-y**4 - x**2 * y**2 + 2)*sin(x*y) + 4*x*y*cos(x*y) + 6*x;
}
// -------------------------------------------------------------------
// --> sum: sum all elements in array x
// -------------------------------------------------------------------
proc asum(x: [] real): real {
   var sum: real = 0.0;
   for a in x do {
      sum += a;
   }
   return sum;
}

bradcray · May 28, 2025, 8:42pm

Hi Nelson —

Could the differences you're seeing be due to floating point rounding issues due to the order in which the elements are combined? In particular, a0sum and a1sum are sequentially computed, so will always combine all elements in the same order. But a2sum and a3sum are each computed in parallel, so elements will be combined in a different order than the sequential implementations, and potentially in a different order from run-to-run, depending on how things are scheduled.

Also, are you seeing differences with the default problem size of N=16, or only at larger runs?

-Brad

nelsonluisdias · May 28, 2025, 9:11pm

Hi Brad: the order of summation is a good explanation. I may have inadvertently created a particularly ill-behaved sum... The behavior is the same for several tested Ns: 16, 32, 64, 128, 256, 512.

Thanks

Nelson

bradcray · May 28, 2025, 9:14pm

Hi Nelson —

Adding some debug printing and doing some testing myself, I'm now fairly certain that floating point differences due to the non-associativity/commutativity of floating point math is what's going on here. Specifically:

the results are fairly small, and close to 0.0, sometimes coming out to 0.0 in the parallel versions
running with --dataParTasksPerLocale=1 (as a way to squash dynamic parallelism) makes all four values match.

If you needed the reproducibility, but didn't want to write the loops out longhand, you could write the expressions as a reduction over a sequential loop:

a2sum = (+ reduce for a in ad2phi do a);  // or possibly `foreach` rather than `for`

but of course the tradeoff is that you would lose any performance benefits from using multiple tasks to compute the reduction.

-Brad

nelsonluisdias · May 28, 2025, 9:24pm

Thanks Brad: this behavior appeared as part of a larger set of ongoing trials to paralellize the calculation of a laplacian... To keep things simple and reproducible I will in this case force a sequential sum (I remember seeing algorithms that avoid the loss of accuracy). The non-commutativity of the floating point is an interesting lesson.

Regards

Nelson

nelsonluisdias · May 28, 2025, 9:56pm

Quick update:

both the integral of (-y^4 - x^2 y^2 + 2)sin(x*y) + 4xycos(xy) + 6x over the square
-pi <= x <= +pi and -pi <= y <= +pi
and a Kahan summation of the array give zero: all sums are suffering from a small loss of accuracy. Thanks again!

damianmoz · May 29, 2025, 12:34am

Hi Nelson,

Glad that Brad highlighted accumulated rounding error as your problem. My part of the world was asleep when you raised and solve the issue. For data with wildly variant magnitudes, even Kahan summation is not always as accurate as you like. You might need a 2nd order scheme in such scenarios. Probably not the case for the Laplacian operator unless you have a singularity or pole in your problem. You can parallelize these compensated summations if performance is a problem.

Your expression

(-y**4 - x**2 * y**2 + 2)

scares me. I would write that as

(2 - y**2*(y**2 + x**2))

What specific Laplacian-style PDE are you looking at?

Note the problem gets worse if you are looking at what is an eigen-problem with say the two largest eigenvalues close together in a time-varying problem. You might get two different, but consistently so, answers for the same parallel computation, even f you run it 20 times. Always, those same two answers appear. They probably correspond to a bifurcation point. So, both answers are technically correct.

Even different optimization levels can produce different results for the same expression if it is a sufficiently problematic. Time to go back to your old numerical analysis textbooks. Mind you, not many of these consider the issues with computations done in parallel. Until a language with parallel features like Chapel is widely used in numerical analysis courses, I am not sure you will see such issues discussed in textbooks or even class notes.

My 2c - Damian

nelsonluisdias · May 29, 2025, 11:37am

Hi Damian. Thanks for the comments. I am interested both in second-order compensation schemes and their parallelization, if you can send some references. Right now I am just looking at the numerical calculation of the laplacian of

x**3 + y**2*sin(x*y)

to find how well a straightforward parallelization with forall performs for various NxN grid sizes (after N=1024 it starts to degrade). Fully agree that more textbook references for numerical parallel methods are needed and should emerge with good parallel languages such as Chapel.

Cheers

Nelson

tbzy · June 1, 2025, 1:11am

I guess one reason for why various sums become very different is that the original function (which I will denote f(x,y))

has terms which are odd with respect to x or y and so the exact sum becomes zero when the domain is symmetric for both x and y... (below is a plot from WolframAlpha):

plot x**3 + y**2*sin(x*y) for -pi < x < pi and -pi < y < pi - Wolfram|Alpha

For comparison, this is the results I get from the original version (here I used writef("a0sum = %22.15dr\n", a0sum) for the format):

% chpl -o calc0.x test0.chpl    # original version

% ./calc0.x --N 16
Di = {1..16, 1..16}
a0sum =      0.000000000000654
a1sum =      0.000000000000654
a2sum =      0.000000000000227
a3sum =      0.000000000000227

% ./calc0.x --N 64
Di = {1..64, 1..64}
a0sum =     -0.000000000034589
a1sum =     -0.000000000034589
a2sum =      0.000000000011823
a3sum =      0.000000000011823

% ./calc0.x --N 256
Di = {1..256, 1..256}
a0sum =      0.000000000076703
a1sum =      0.000000000076703
a2sum =      0.000000000349246
a3sum =      0.000000000349246

% ./calc0.x --N 1024
Di = {1..1024, 1..1024}
a0sum =     -0.000000006254048
a1sum =     -0.000000006254048
a2sum =     -0.000000130385160
a3sum =     -0.000000130152330

Because the above result is a direct sum of the laplacian, I also tried multiplying it by the cell volume (del**2) to make it an integral

   // ad2phi[i,j] = anadel2(x,y);   // original version
   ad2phi[i,j] = anadel2(x,y) * del**2;  // with cell volume

which gives:

% chpl -o calc0vol.x test0.chpl   // with cell volume

% ./calc0vol.x --N 16
Di = {1..16, 1..16}
a0sum =     -0.000000000000055
a1sum =     -0.000000000000055
a2sum =      0.000000000000043
a3sum =      0.000000000000043

% ./calc0vol.x --N 64
Di = {1..64, 1..64}
a0sum =      0.000000000000007
a1sum =      0.000000000000007
a2sum =     -0.000000000000057
a3sum =     -0.000000000000057

% ./calc0vol.x --N 256
Di = {1..256, 1..256}
a0sum =     -0.000000000000051
a1sum =     -0.000000000000051
a2sum =     -0.000000000000057
a3sum =     -0.000000000000057

% ./calc0vol.x --N 1024
Di = {1..1024, 1..1024}
a0sum =     -0.000000000000622
a1sum =     -0.000000000000622
a2sum =      0.000000000000796
a3sum =      0.000000000000796

Also, to make the exact sum non-zero, I've also tried adding an even function to f(x,y) (here just a Gaussian centered at the origin), which gives

% chpl -o calc1.x test1.chpl 
  // with a Gaussian function added (the code is attached below)

% ./calc1.x --N 16
N = 16
Di = {1..16, 1..16}
a0sum =     -0.361299100625812
a1sum =     -0.361299100625812
a2sum =     -0.361299100625672
a3sum =     -0.361299100625658

% ./calc1.x --N 64
N = 64
Di = {1..64, 1..64}
a0sum =     -0.258152565926877
a1sum =     -0.258152565926877
a2sum =     -0.258152565926792
a3sum =     -0.258152565926792

% ./calc1.x --N 256
N = 256
Di = {1..256, 1..256}
a0sum =     -0.234023971743991
a1sum =     -0.234023971743991
a2sum =     -0.234023971743881
a3sum =     -0.234023971743888

% ./calc1.x --N 1024
N = 1024
Di = {1..1024, 1..1024}
a0sum =     -0.228115899032804
a1sum =     -0.228115899032804
a2sum =     -0.228115899035892
a3sum =     -0.228115899035892

and so, yeah, I think the apparently large difference between various sums is due to the cancellation among oscillatory positive + negative regions to give essentially zero...

(the code with an additional Gaussian here)
test1.chpl (1.7 KB)

Topic		Replies	Views
Partial reductions (operate on 2D-array and produce 1-D array) Users	19	443	December 31, 2022
Cache Unfriendly Reduction Users	9	202	January 21, 2024
Code optimization Users	28	524	December 28, 2022
Array initialization by expression Users	15	299	March 13, 2023
I love chapel's set array operations, but the edge cases are really hard Users	7	323	June 22, 2022

Doubt on ```+ reduce``` applied to a 2D array

Related topics