Block/stencil dists when simulating multi-locale in a single-local machine

Hello! This is a bit long, but I need to provide the full background to my question. I am trying to simulate a multi-locale computer on my desktop, which only has 1 cpu. To do this, I first did

source util/setchplenv.bash
==> source mlchplenv.sh
make

where mlchplenv.sh is

#!/bin/bash
# --------------------------------------------------------------------
# Simulation of Multi-locale Chapel
# --------------------------------------------------------------------
export CHPL_HOME=~/Dropbox/software/mlchapel-1.24.1
CHPL_BIN_SUBDIR=`"$CHPL_HOME"/util/chplenv/chpl_bin_subdir.py`
# --------------------------------------------------------------------
# reset path
# --------------------------------------------------------------------
export PATH="$PATH":"$CHPL_HOME/bin/$CHPL_BIN_SUBDIR"
export MANPATH="$MANPATH":"$CHPL_HOME"/man
# --------------------------------------------------------------------
# for the time being, I am putting all modules here
# --------------------------------------------------------------------
export CHPL_MODULE_PATH=/home/nldias/Dropbox/chapel/modules
# --------------------------------------------------------------------
# use all cores!
# --------------------------------------------------------------------
export CHPL_RT_NUM_THREADS_PER_LOCALE=MAX_LOGICAL
# --------------------------------------------------------------------
# This seems enough to generate a "multi-locale" compiler
# --------------------------------------------------------------------
export CHPL_COMM=gasnet
export CHPL_TARGET_CPU=none
export CHPL_LAUNCHER=smp
export CHPL_COMM_SUBSTRATE=smp
export GASNET_ROUTE_OUTPUT=0

This compiled the Chapel compiler without errors and executed simple example programs in Chapel's home page correctly

Then I went on to implement a solution of d^2u/dx^2 = 0, u(0) = 0, u(1) = 1 with a relaxation method. I did it serially, then with a Block distribution (using -nl 4), then with a Stencil distribution (using -nl 4). The first two ran OK and produced the same output for 16 internal points plus two boundary points, namely

0.0 0.0588235 0.117647 0.17647 0.235294 0.294117 0.352941 0.411764 0.470588 0.529411 0.588235 0.647058 0.705882 0.764706 0.823529 0.882353 0.941176 1.0

The Stencil version failed miserably, however, giving

0.0 0.1 0.2 0.3 0.4 0.5 0.541667 0.583333 0.625 0.666667 0.708333 0.75 0.791667 0.833333 0.875 0.916667 0.958333 1.0

Here is the Stencil version: the serial version and the Block dist version are pretty much the same code, with the obvious changes/omissions:

use StencilDist;
const Di = {1..16};      // the internal subdomain
const D = {0..17}        // the total domain and its bounding box
   dmapped Stencil(boundingBox=Di,fluff=(1,));
var u: [D] real;         // the solution
var deltau: [D] real;    // the residues
const epsilon = 1.0e-10;  // the overall accuracy
var deltam = epsilon;    // the "norm" over the residues
u[0] = 0.0;              // a ghost point (BC at 0)
u[17] = 1.0;             // a ghost point (BC at 1)
u[Di] = 0.5;             // the initial guess
var nc = 0;              // iteration counter
for loc in Locales do {
   on loc {
      writeln(u.localSubdomain());      // subdomains, per locale
   }
}
while deltam >= epsilon do {       // check convergence
   deltam = 0.0;
   forall i in 1..15 by 2 do {
      var uavg = (u[i+1]+u[i-1])/2.0;
      deltau[i] = (uavg - u[i]); 
      u[i] += deltau[i];
      deltau[i] = abs(deltau[i]);
   }
   u.updateFluff();
   forall i in 2..16 by 2 do {
      var uavg = (u[i+1]+u[i-1])/2.0;
      deltau[i] = (uavg - u[i]); 
      u[i] += deltau[i];
      deltau[i] = abs(deltau[i]);
   }
   u.updateFluff();
   deltam = (+ reduce deltau)/16;
   nc += 1;
}
writeln(nc);
writeln(u);

Any light on what I am doing wrong, or why Stencil does not work in this simulated multi-locale environment, will be greatly appreciated.

Cheers

Nelson

Hi Nelson —

I'm able to reproduce the difference you're seeing, and believe that the issue is quite possibly a bug on our part, though I haven't been able to determine what's causing it yet. I'm timing out for the moment, but will hope to get back to this later tonight or tomorrow.

-Brad

Hi Nelson —

Aha, got it. Happily (for me, as a Chapel developer), this was a user error—or, actually, two user errors. And happily (for you), it was sufficiently non-obvious that you shouldn't feel bad about not seeing it given how long it took me to sort it out (where part of my problem was not having sufficient faith in the Stencil distribution, so focusing too much on assuming it was buggy). I'll give some suggestions below which, in retrospect, would have been helpful; but I didn't think to use them, so don't interpret it as lecturing you so much as giving you some tools that may help in the future.

The first bug jumped out at me quickly, but didn't fix things, yet it is important for correctness and generality, and it also seems to affect the number of iterations required to converge. That is that the program currently begins by reading and writing u using a stencil pattern; then calls updateFluff(). Whereas what I think you should be doing is calling updateFluff() just before each of your forall loops rather than after, to ensure that the fluff values (halo values, ghost cells, overlap regions, whatever you want to call them) are up-to-date. Thus, where you currently have:

  forall i in ... do {
    ...
   }
   u.updateFluff();

you want:

   u.updateFluff();
   forall i in ... do {
    ...
   }

The second problem is the big one, though: The issue relates to how your forall loops are being implemented, which would generally result in scalability problems for the Block distribution, yet which also causes correctness issues for what you're trying to do with the Stencil distribution.

Essentially, given a forall loop like this in Chapel:

   forall i in 1..15 by 2 do { ... }

the iterations of the loop will all execute on the locale where the task encountering the loop is running—in this case locale 0 since that's where the program started running. The issue is that ranges are, by nature, local types in Chapel, and so forall loops over ranges use local resources by default. You can verify this by putting a writeln(here.locale, " owns ", i); statement within your loops' bodies to see that each iteration executes on locale 0.

Thus, for a loop like the above, typically #cores (numPUs()) tasks will be created on locale 0, and they will divide the range's indices between themselves. With the Block-distributed version, this results in correct code because Chapel's global namespace permits the tasks to access the array elements whether they are local or remote. But it won't be scalable code because, as you add locales, you won't be adding more parallelism, and will only be increasing the amount of communication as a larger fraction of the array elements are remote with respect to locale 0.

Now, the reason that this turns into a correctness problem with the Stencil distribution is that the behavior of a stencil-distributed array is to access the element locally if it's available; and remotely only if it's not. And where this gets tricky is that elements on the boundaries live on two locales—once as the primary copy and once as the cached/fluff/halo value. So, for example, if I'm running on 2 locales, locale 0 owns the primary copies of {0..8} but a cached value for {9}. This means that a task running on locale 0 accessing element 9 will access its local cached copy; but that a task running on locale 1 accessing element 9 will access the real copy. And this is what prevents the Stencil distribution from doing a bunch of fine-grain communication for +/-1 accesses with a fluff value of 1. But in your case, since all your tasks were running on locale 0, they were updating the local (cached) copy of element 9, and then when calling updateFluff(), that value was being overwritten by the "real" element 9, owned by locale 1.

Something that could have helped with this—and that we've wrestled with a bit over time, but not acted on—would be to only permit local reads to fluff and to disallow local writes (or perhaps to require a user to opt into writing to fluff somehow). If we'd done that, you'd either have gotten some sort of locality violation warning ("Hey, you can't write to element 9 from locale 0!") or perhaps you'd have just gotten a remote write to the real element 9 and just been back to the scalability issues of the Block distribution.

In retrospect, one way that you might have been able to detect what was going wrong, would've been to rewrite your array accesses from u[i] to u.localAccess[i] as a means to assert that you expect all array accesses to be local, and to have Chapel complain at you if they weren't. I don't believe this would have prevented the write to the fluff in any way, but it may have alerted you to the fact that your tasks weren't running where you might have expected them to be and led to the insight I eventually reached.

Anyway, cutting to the chase (belatedly), what's the fix? Well, you really want those loops to be iterating over a distributed domain that is aligned with the D domain. And the easiest way to do this is to take a slice of the D domain with the set of indices you want to iterate over. Here's how I expressed it in my fixed version of the code:

const D = {0..17}        // the total domain and its bounding box
   dmapped Stencil(boundingBox=Di,fluff=(1,));

const Dodd = D[1..15 by 2],  // could also call these things like `Red` and `Black` or ...
         Deven = D[2..16 by 2];

In a domain slice, or intersection, like this, the distribution of the original domain is preserved, so Dodd and Deven will describe the specified subset of D's indices, yet aligned with D. So then if you write your forall loops like:

forall i in Dodd ...

forall i in Deven ...

you should see your writeln()s that I suggested inserting (temporarily) print the proper ownership, better scalability with the Block distribution, and both correctness and better scalability with the Stencil distribution.

Hope this makes sense and helps, but ask more questions if not,
-Brad

PS — And feel free to file a GitHub issue against us if you like the idea of having the Stencil distribution disallow writes to its fluff by default. I believe that would be fairly easy for us to implement and would've saved a lot of trouble here, in retrospect.

Hi Brad: many thanks for the very clear and detailed answer! It was really helpful. I have implemented your suggested changes, and now everything is fine. I am too green with distributions to have a say on how Stencil should act on local writes, but I will keep this in mind as I try to implement more substantial cases.

Regards,

Nelson

1 Like

Sounds good, thanks Nelson. This morning, I decided to go file the issue myself because I think this is pretty confusing behavior, and stumbled back across this one, which I had a vague memory of last night:

So my current intention is to re-read this thread, see how our experience this week affects my opinions on it, and add some notes there, looking for actionable items.

-Brad

(Oh, I also wanted to note that my eventual goal for Chapel would be that a user wouldn't need to explicitly use a stencil distribution or updateFluff(), but that the compiler and Block distribution would be capable of doing ghost cell management automatically as we did in the ZPL language... but we're not quite there yet, unfortunately).

-Brad

I went to the github thread about confusing Stencil behavior, and modified the 1st example a little bit:

use StencilDist;
var space = {0..7};
var dom = space dmapped Stencil(space, fluff=(1,), periodic=true);
var arr: [dom] int;
for i in dom do {
   arr[i] = i;
}
writeln(arr);
for loc in Locales do {
   on loc {
      writeln(arr.localSubdomain());      // subdomains, per locale
   }
}
arr.updateFluff();
writeln(arr[-1]);     //read from the fluff; prints the value in arr[7]
on Locales[1] do {
arr[-1] = 14;         //write to the fluff; doesn't modify arr[7]
writeln("-1 mod 8 = ", mod(-1,8));
writeln("in the fluff, arr[-1] = ",arr[-1]);
arr[4] = 14;          //write to the fluff; modifies arr[4] correctly
}
writeln(arr);
arr.updateFluff();
on Locales[1] do {
   arr[mod(-1,8)] = 15;       // write to the fluff; modifies arr[7] correctly
}
arr.updateFluff();
writeln(arr);

Indeed, this is strange, and the problem seems to be with the Stencil module and the compiler accepting -1 as a valid index because (apparently) of the periodic=true argument: modifiying arr[7] (using mod, just to be fancy) directly works (as arr[4] already did).

Hi Nelson —

Sorry for the delayed response here (we had a long weekend this week).

Spending a bit more time with the issue that I referred to above and the documentation of the Stencil distribution, I think that I actually should've considered this to be a bug in our code. Specifically, I'd forgotten about this line in the documentation:

Any write to array data will be applied to the actual element, the same as if you were using a Block-distributed array.

which suggests that the write to the fluff element should not have been permitted. Thus, when I said:

the behavior of a stencil-distributed array is to access the element locally if it's available; and remotely only if it's not.

it seems I was misremembering... This should only be the case for reads, not for writes. There is that funny little caveat about global boundary elements (which is what the lion's share of that issue focused on), but since in your case we're talking about an interior boundary, I'm ignoring that aspect for now.

My next step is to look at the code and see why it's not behaving as advertised.

In any case, I think the proposed rewrite is still the best approach in terms of scalability and capturing what I think your true intention is, but it would've been better if it had only resulted in performance surprises and not subtle correctness surprises as well.

-Brad

Hi Nelson —

I now have additional insights and corrections: Stencil is working as advertised, and I think your original program's behavior can be called a user issue, but it's subtle enough that I want to work harder to see what we can do to prevent it.

Specifically, based on my original diagnosis and previous comment, I was expecting the following loop in the context of your original program to produce different output on 1 vs. 2 locales:

forall i in 1..15 by 2 do {
  u[i] = 0.125;
}

However, this program behaved as expected.

So, digging deeper, here's what was going wrong with your original program. Given the simplified loop:

forall i in 1..15 by 2 do {
  deltau[i] = 0.125;
  u[i] = deltau[i];
}

the problem is that deltau is distributed by the same Stencil distribution as u, even though it isn't ever used in a Stencil pattern. The implication of this is that the write to deltau[i] writes the real array element, but that the read of deltau[i] on the next line is reading a fluff value. However, since deltau.updateFluff() wasn't called, this will return a garbage value when deltau[i] is on a remote locale, but a cached value of it exists locally. This is obviously confusing, since a simple read of the program would seem to suggest that the same value written has just been read.

This can be verified by splitting the loop into two loops and calling updateFluff() on deltau:

forall i in 1..15 by 2 do {
  deltau[i] = 0.125;
}
deltau.updateFluff();
forall i in 1..15 by 2 do {
  u[i] = deltau[i];
}

Or, if we were rewriting the original loop as:

forall i in 1..15 by 2 do {
  var du = 0.125;
  deltau[i] = du;
  u[i] = du;
}

things would also work, since we're not relying on reading deltau anymore.

The reason my proposed rewrite fixed this was that by having the "obvious"/owning locale do the reads and writes to u and deltau from the owning locale, we avoid this issue in which we're reaching across to a remote value that could've been computed locally. Another fix would've been to declare deltau using a simple Block distribution with no caching.

Though I call this a user issue, it's sufficiently confusing that I think it needs to be addressed (i.e., it shouldn't take a user coming to us to understand it; and it shouldn't take me three attempts to explain what's happening correctly). I'm planning to look into the "write-through" idea that @e-kayrakli proposed on the related issue we've been discussing to see whether it can be put together in short order, which I believe would address this (the write to deltau[i] would update both the fluff value and the original value. It seems to me that it should be tractable, if not easy, but I could definitely be missing something.

Best wishes,
-Brad

Shoot, I was missing something. I was abstractly thinking of writing to an array element as "We'll just write to two array elements, easy!" But I was forgetting that there is no "write to array element" interface; rather, it's a "return a reference to array element that can be written" interface. So in this case we'd effectively have to return two references to array elements, which doesn't work... So I think we'll have to live with the current behavior for awhile longer until we can come up with a way to implement a write-through pattern.

That said, this pattern is very similar to another I've been wrestling with lately, which is encouraging. I've filed Rethink "write to array element" interface? · Issue #17999 · chapel-lang/chapel · GitHub to capture the issue.

-Brad

Hi Brad: sorry for the very late response: this has been a very busy week, unfortunately away from Chapel. I learned quite a bit about the Stencil distribution from all your comments. Below is a "final" version of the laplace 1d solution. The upshot is that yes, updateFluff before the forall's makes a lot of difference: convergence is achieved in 410 iterations; if I put the updateFluff after each forall, then it takes 507. In the program I am not using deltau.updateFluff: it slows down the program, and because deltau is strictly a distributed array, it does not seem to make any difference either on convergence or on the correctness of the final values. I guess that the (+ reduce deltau) statement takes care of polling the distributed values of if across 4 locales (running with -nl 4) and doing the right thing. Many thanks: I am quite happy to live with StencilDist as it is now that I know it a little better

use StencilDist;
const Di = {1..16};      // the internal subdomain
const D = {0..17}        // the total domain and its bounding box
   dmapped Stencil(boundingBox=Di,fluff=(1,));
const Dodd = D[1..15 by 2];
const Deven = D[2..16 by 2];
var u: [D] real;         // the solution
var deltau: [D] real;    // the residues
const epsilon = 1.0e-10; // the overall accuracy
var deltam = epsilon;    // the "norm" over the residues
u[0] = 0.0;              // a ghost point (BC at 0)
u[17] = 1.0;             // a ghost point (BC at 1)
u[Di] = 0.5;             // the initial guess
var nc = 0;              // iteration counter
for loc in Locales do {
   on loc {
      writeln(u.localSubdomain()); // subdomains, per locale
   }
}
while deltam >= epsilon do {       // check convergence
   deltam = 0.0;
   u.updateFluff();
   forall i in Dodd do {
      var uavg = (u[i+1]+u[i-1])/2.0;
      var du  = (uavg - u[i]); 
      u[i] += du;
      deltau[i] = abs(du);
   }
   u.updateFluff();
   forall i in Deven do {
      var uavg = (u[i+1]+u[i-1])/2.0;
      var du  = (uavg - u[i]); 
      u[i] += du;
      deltau[i] = abs(du);
   }
   deltam = (+ reduce deltau)/16;
   writeln(deltam);
   nc += 1;
}
writeln(nc);
writeln(u);

Hi Nelson —

Thanks for the response and no worries about the delay. It was probably a good thing to help me get my facts straight anyway. :slight_smile: I'm glad you feel on the right track, and won't obsess about StencilDist for now, though I do think we should improve it to prevent other users from stepping into this trap in the future.

A few comments in response to your mail, though they're getting increasingly esoteric. However, if you read nothing else, be sure to read from "THE GOOD STUFF" onwards:

The upshot is that yes, updateFluff before the forall's makes a lot of difference: convergence is achieved in 410 iterations; if I put the updateFluff after each forall, then it takes 507.

This sounds similar to what I was seeing, and was surprising to me how large of an impact it had. If I understood the flow of the code correctly, it seemed as though putting them before vs. after should only change the initial conditions, not the steady state (since, once the loop is running, it's symmetric I think, right?). Is it clearer to you why the difference was so large? Is it simply that the first loop reads a 0.0 rather than a 0.5 because things aren't updated and it takes that long to correct for the mistake? Or something else that I'm not understanding?

In the program I am not using deltau.updateFluff

I agree that you should not do this and was only indicating it as the way to make the original program function correctly with no other changes. But the better way to fix the original program is to drive the loops using the distributed Dodd / Deven domains (as you are now) which avoids the "write to deltau's fluff" issue by running all iterations on their logical home locales (while also distributing the work better, which will get you better scalability).

I guess that the (+ reduce deltau) statement takes care of polling the distributed values of if across 4 locales (running with -nl 4) and doing the right thing.

I don't think it's that; it's simply that the deltau elements are now always written/read from the locale owning the element since the forall loops have been properly distributed using Deven/Dodd. As a result, the fluff values of deltau (which is where things were going wrong in the original program) never come into play.

In fact, because its fluff is unused, deltau could be declared as a traditional Block array rather than a Stencil array, and I've been wrestling whether to propose this or not. On one hand, doing so would make it a simpler array type and reduce the number of unused fluff values allocated (not a big deal for 1D arrays; possibly slightly more impactful in terms of memory overhead and cache usage for a 2D or 3D array). However, it has the downsides of using two distributed array types rather than one, which would likely increase compilation time and potentially obscure the aligned relationship between the Stencil-distributed domain D and the Block-distributed domain that would be used to declare deltau for the Chapel compiler.

[I'm tagging my colleague @e-kayrakli here, who has been implementing locality-based optimizations recently to make him aware of this tension. No response required, Engin, just wanted to put it in front of you as food for thought / an interesting case study. And I can summarize in person sometime to save you reading this thread].

THE GOOD STUFF: Ah, but here's the best approach of all: What if we eliminate the deltau array altogether. This reduces our overall memory footprint and working set while computing the reduction as we go rather than as a separate step. The following version does this:

use StencilDist;
const Di = {1..16};      // the internal subdomain
const D = {0..17}        // the total domain and its bounding box
   dmapped Stencil(boundingBox=Di,fluff=(1,));
const Dodd = D[1..15 by 2];
const Deven = D[2..16 by 2];
var u: [D] real;         // the solution
const epsilon = 1.0e-10; // the overall accuracy
var deltam = epsilon;    // the "norm" over the residues
u[0] = 0.0;              // a ghost point (BC at 0)
u[17] = 1.0;             // a ghost point (BC at 1)
u[Di] = 0.5;             // the initial guess
var nc = 0;              // iteration counter
for loc in Locales do {
   on loc {
      writeln(u.localSubdomain()); // subdomains, per locale
   }
}
while deltam >= epsilon do {       // check convergence
   deltam = 0.0;
   u.updateFluff();
   forall i in Dodd with (+ reduce deltam) {
      var uavg = (u[i+1]+u[i-1])/2.0;
      var du  = (uavg - u[i]); 
      u[i] += du;
      deltam += abs(du);
   }
   u.updateFluff();
   forall i in Deven with (+ reduce deltam) {
      var uavg = (u[i+1]+u[i-1])/2.0;
      var du  = (uavg - u[i]); 
      u[i] += du;
      deltam += abs(du);
   }
   deltam /= 16;
   writeln(deltam);
   nc += 1;
}
writeln(nc);
writeln(u);

where the diff from your previous version is:

8d7
< var deltau: [D] real;    // the residues
23c22
<    forall i in Dodd do {
---
>    forall i in Dodd with (+ reduce deltam) {
27c26
<       deltau[i] = abs(du);
---
>       deltam += abs(du);
30c29
<    forall i in Deven do {
---
>    forall i in Deven with (+ reduce deltam) {
34c33
<       deltau[i] = abs(du);
---
>       deltam += abs(du);
36c35
<    deltam = (+ reduce deltau)/16;
---
>    deltam /= 16;

Essentially, what I've done is add a + reduce intent to each of the forall loops for the deltam variable which will cause each of the tasks implementing the loop to store its own local deltam value and to combine them with the original deltam value at the loop's end using the + operation. Then by changing the write to deltau[i] into a += on deltam, I accumulate into the per-task values as we go.

(the one other change I made is purely cosmetic, but good to know about: Chapel loops either support a single-statement form with the do keyword or a multi-statement form using block statements ({ ... }). The two can be combined as you were (do { ... }) since the block statement is itself a single statement, but it's not necessary unless it's your preference stylistically).

I didn't do any detailed timings, but from my perception of the rate of the computation, this change seemed to be significant, so worthwhile if it doesn't mess up the numerical properties of your final algorithm in any way (it didn't seem to for this test case based on the number of iterations).

Best wishes,
-Brad