Hello everyone,
We were making a finite volume and stencils related project and had a function called roll which is similar to numpy.roll and was thinking whether the following implementation will be safe in a distributed computing environment.

``````proc roll(a:[?D],in shift, axis = 0){
var a1:domain(a.rank);
var a2:domain(a.rank);

if(a.rank == 1){

if(shift<0){
shift *= -1;
a1 = {D.low+shift..D.high};
a2 = {D.low..D.low+shift-1};
}else{
a1 = {D.high-shift+1..D.high};
a2 = {D.low..D.high-shift};
}

var rez:[D] a.eltType;
rez[D.low..a1.size] = a[a1];
rez[a1.size+1..D.high] = a[a2];
return rez;
}
else if(a.rank == 2){

var first_half:domain(2);
var second_half:domain(2);
if(axis == 0){
if(shift<0){
shift *= -1;
a1 = {a.dim(0).low+shift..a.dim(0).high, a.dim(1)};
a2 = {a.dim(0).low..a.dim(0).low+shift-1, a.dim(1)};
}else{
a1 = {a.dim(0).high-shift+1..a.dim(0).high, a.dim(1)};
a2 = {a.dim(0).low..a.dim(0).high-shift, a.dim(1)};
}
second_half = {a1.dim(0).size+1..a.dim(0).high,a.dim(1)};
first_half = {a.dim(0).low..a1.dim(0).size,a.dim(1)};
}
else{
if(shift<0){
shift *= -1;
a1 = {a.dim(0), a.dim(1).low+shift..a.dim(1).high};
a2 = {a.dim(0), a.dim(1).low..a.dim(1).low+shift-1};
}else{
a1 = {a.dim(0), a.dim(1).high-shift+1..a.dim(1).high};
a2 = {a.dim(0), a.dim(1).low..a.dim(1).high-shift};
}
second_half = {a.dim(0),a1.dim(1).size+1..a.dim(1).high};
first_half = {a.dim(0),a.dim(1).low..a1.dim(1).size};
}

var rez:[D] a.eltType;
rez[first_half] = a[a1];
rez[second_half] = a[a2];
return rez;
}

}
``````

So the above function is just supposed to roll the given array on either axis to left or right.

Also please help if the function can be improved. We were also thinking of trying to do it in a stencil approach but aren't getting anywhere.

-- Dhruv

Hi Druhv —

With a quick read-through of the code, I believe that this will be safe in a distributed environment given that you're using a distinct array for the input vs. output, so there shouldn't be any data races, which would be the main cause of problems.

That said, there's a subtle semantic issue here which may cause some performance issues if `a` is a distributed array, which would be good to be aware of and possibly work around. Specifically, when slicing an array with a domain, as in your `a[a1]` expressions for example, the slice expression uses the slicing domain as the expression's logical domain. Thus, for a case like this where `a1` and `a2` are default / local / non-distributed domains (because nothing has said that they're distributed), the `a[a1]` expression is logically local. Put another way, if you said `var b = a[a1]`, `b` would be a completely local array, governed by domain `a1`, similar to `var b: [a1] real;`. You're then assigning that to a distributed slice (because slicing by ranges, as in `rez[D.low..a1.size]` preserves the distribution of the original array), which may mean that using the local domain is not problematic, but I'm not confident and it makes me a little worried. Specifically, I worry a little that in the 2D case, since both the LHS and RHS slices use a local domain, that it may be that only one locale's cores would be involved in making the assignment happen: `rez[first_half] = a[a1];`

More discussion around the design of this semantic property is available here, if interested: https://github.com/chapel-lang/chapel/issues/12936

An easy way to preserve the distribution of the original array in your domains would be to use domain slicing to compute the `a1` and `a2` subdomains, where domain slicing essentially computes the intersection between two domains, or a domain and a range per dimension. For example, writing `var a1 = D[D.low+shift..D.high]` would give you a domain `a1` which preserves/matches `D`'s distribution. Of course, your control flow may make that approach more challenging, so another approach would be to write `var a1: D.type;` to have `a1` share the same type as `D`. Then in assignments like `a1 = {D.low+shift..D.high};`, the LHS would end up being a distributed version of the RHS domain (since domain assignment only copies indices, not distributions).

Here's a toy example that demonstrates these approaches and shows that the results are still block-distributed:

https://tio.run/##VYw9D4IwEIZ3fsWNsFxENxoXcyMTK3FooJbGWpq2Sozht9cqDDC9H/e81w3cCh3j0wu46LG7k/KBZd1obkpCEh/AwBnKA8te3AEl/ykRzQz9g1sr@mWWL2WxUnXCqD0hEg5KDte1biogDG8rfl8I9TghHmeWTU4FoU1OdbEJzTbU/13lg1NG7qj9IcYv

The other main thought I had looking at your code is that Chapel has some rank-independent programming features that might make it possible to write this in a way that didn't have to enumerate the per-rank / per-axis options as you currently do. For rectangular domain computations, the most frequent way to do this is to get a tuple of ranges describing the index set, modify those ranges, and then assign them into a domain.

For example, here's a toy code that rewrites a specified dimension of a domain with -100..100 (I'm not providing a TIO link because TIO's version of Chapel is out-of-date and uses 1-based indexing for tuples and domain dimensions rather than 0-based indexing, as with recent releases:

``````proc foo(D, dim: int) {
var dims = D.dims();
writeln(dims);
dims(dim) = -100..100;
writeln(dims);
var dom: D.type = dims;
return dom;
}

writeln(foo({1..10}, 0));
writeln(foo({1..10, 1..10}, 0));
writeln(foo({1..10, 1..10}, 1));
``````

Though I haven't tried, it seems as though it ought to be possible to write a helper routine that would know how to create the two ranges for the shifted dimension and then to use a similar approach to patch that single dimension without touching / naming the others to create a rank-independent version of the routine.

Finally, the million-dollar question for this routine will be how it performs. We've put some effort into making these whole-array assignment operations use coarse communications, but I can never quite remember which cases are currently handled and how well or poorly. In practice, I tend to use the CommDiagnostics to see what communication statements like `rez[first_half] = a[a1];` induce. If you find it's terrible in some way, that would be good for us to know and optimize (either within the array assignment code on our side, or we could provide tips for optimizing it in your routine). Related, there's an internal compile-time setting `-schpl_serializeSlices` that would be interesting to see how it changes the communication behavior of this code.

Hope this is helpful information and the sort of thing you were looking for. Please feel encouraged to ask follow-up questions.