BlockDist's greedy algorithm for reshaping target locales

BlockDist — Chapel Documentation 1.28 says the following:
The rank of targetLocales must match the rank of the distribution, or be 1 . If the rank of targetLocales is 1 , a greedy heuristic is used to reshape the array of target locales so that it matches the rank of the distribution and each dimension contains an approximately equal number of indices.

I am interested in what the greedy heuristic algorithm is. Is it trying to reshape the targetLocales so that each dimension has almost the same number of locales? Or it reshapes the targetLocales so that each dimension gets almost the same number of indices in the boundingBox?

Lastly, what is the greedy algorithm's implementation? Is it minimizing the maximal difference of the elements?

Thanks!

Hello @Anjiang,

The heuristic is implemented in DSIUtil.chpl, specifically in the routine:

proc _factor(param rank: int, value)

where value is the number of target locales to be reshaped. The while-loop is the core of the heuristic, setting each component of the tuple factors to the size of the reshaped array in the corresponding dimension.

According to the documentation you site, the heuristic strives for equal dimensions of the reshaped array, under the constraint that its size must equal the original number of target locales (value). For example, for a 2-dimensional block-distributed domain/array and a 1-dimensional original array of target locales:

4 target locales are reshaped into a 2*2 array
5 --> 5 * 1
6 --> 3 * 2

Vass

1 Like

Vass beat me to this by a minute, but to complement his answer:

At a high-level, I think of our approach as being to make the locale array as square as possible (for good surface::volume ratios) and to err on the side of more locales in early dimensions than later ones (to get larger blocks of consecutive elements due to our row-major order layout). For example in the 6-locale example Vass gives, a 100x100 domain targeting a 3x2 locale array would result in ~33x50 blocks, leading to consecutive blocks of 50 elements in the innermost dimension (so, laid out consecutively in memory), whereas using a 2x3 layout would result in 50x~33 blocks per locale, giving only ~33 adjacent elements per row.

All that said, I don't recall how hard we worked at that, nor how good it is. Part of our reason for being a bit vague about the heuristic is to reserve the right to update it in the future.

If you haven't already encountered this pattern elsewhere, a simple program to view the locale layout by virtue of which elements they own is:

use BlockDist;

config const n = 10;

var A = blockDist.createArray({1..n, 1..n}, int);
[a in A] a = here.id;
writeln(A);

-Brad

1 Like

Thanks for the prompt replies!
Based on my understanding, the current heuristic will work well if the n-D array is created with equal sizes in all dimensions, but may not work well if the n-D array itself is unbalanced.

I think the auto-reshape strategy should be as follows. Suppose you have 6 locales:

  • If the 2D array is 200x300, then reshaping the 6 locales to 2x3 is the best so that each locale can get 100x100;
  • If the 2D array is 300x200, then reshaping the 6 locales into 3x2 is the best, so that each locale can get 100x100;
  • If the 2D array is 100x600, then reshaping the 6 locales into 1x6 is the best, so that each locale can get 100x100;
  • If the 2D array is 600x100, then reshaping the 6 locales into 6x1 is the best, so that each locale can get 100x100;

However, I find that the current heuristic https://github.com/chapel-lang/chapel/blob/1dd8dcaa17/modules/dists/DSIUtil.chpl#L184 aims to factorize the number of locales as equal as possible without taking into account the array's domain. If I understand correctly, the current heuristics will factorize 6 into 3x2 all the time for any 2D array (the code at the end of the function _factor makes sure that the blocking factors are descending, i.e., using 3x2 instead of 2x3).

@Anjiang : That's an interesting idea. I think some potential downsides are:

(a) it makes the assumption that surface::volume ratio is more important to optimize for than long strips of consecutive array elements is, which may be true for some programs, but also may not be for others.

(b) for an array whose bounds change from run-to-run, the locale topology would also change from run-to-run which could lead to performance surprises for the user if they weren't expecting it. Maybe those would be attractive surprises, but maybe not.

Of course, nothing prevents a user or package library from implementing such a heuristic as a helper routine and invoking that helper when creating new block distribution instances (e.g., targetLocales=myFavoriteHeuristic().

Personally, I favor keeping the current behavior for its slightly simpler nature and better consistency. But if we had performance results or key user codes showing that this was a big win, that would definitely be good motivation to consider it further. A challenge to gathering such results would be to also ensure that it didn't particularly hurt other key cases.

-Brad

1 Like

Thanks for the explanations!

I have a new question:
I guess locales are by default 1D. For multi-node multicore CPU scenario, I would like to be the case that each CPU (i.e., locale) is identified by a node ID and a processor ID, which is 2D. How to reflect this topology in reshape operators in locale? E.g., 2-node, 8-core CPU per node, I assume the first 8 CPUs from one node, and the last 8 CPUs are from the other node, so we can reshape the locales by 2x8? Or maybe reshape by 8x2? (does reshape has a clear semantic?)

Also, can the above analysis also apply to a multi-node multi-GPU scenario (where one node has multiple GPUs)?

Hi @Anjiang

Historically, Chapel's preferred model has been to run a single locale (process) per compute node. So in your scenario, we'd think of that as a 2-locale system, where each locale was capable of running 8 tasks in parallel, one per core. We've recently added a capability called co-locales which is designed to support multiple processes per locale, but still at a coarse granularity, like 1 process per NIC or 1 process per socket. Generally speaking, it isn't a good practice to run a Chapel program with a locale per core (and our co-locale support doesn't currently permit that).

When GPUs come into play, we similarly don't think of those as separate/sibling locales, but rather as sub-locales of the top-level locales representing the nodes (or NIC/sockets in a co-locale model).

For more on co-locales, see Multilocale Chapel Execution — Chapel Documentation 1.32

For more on using GPUs within Chapel, see GPU Programming — Chapel Documentation 1.32

-Brad

1 Like