New Issue: how to mark loops as order independent without adding qthreads-tasks

16404, “mppf”, “how to mark loops as order independent without adding qthreads-tasks”, “2020-09-17T19:05:17Z”

Split off from a discussion in #7761 driven by #16152.

A forall loop is order independent (by definition, according to the spec). But the iterator implementing a forall statement isn’t necessarily order independent. For example, a follower iterator might include a cursor pattern (this comes up today for follower iterators in the Random module):

// imagine serial/standalone/leader iterators too
iter myiter(param tag: iterKind, followThis) ref
where tag == iterKind.follower {
  var cursor: real;
  for i in _value.these(tag, followThis) {
    cursor = nextcursor(cursor);
     yield (i, cursor);
  }
}

Order independence is important in order to vectorize and is related to how we imagine we will support Chapel-native GPU code generation.

So, the question is, how can order independence (or not) be indicated in a follower iterator?

See TODO strategy for vectorization and GPU support for some design choices leading here.

This issue proposes a new way to indicate that a loop is order independent (besides vectorizeOnly) without creating qthreads-style tasks. This will be suitable to use in follower iterators but could also be used elsewhere. It is important that it works with forall-intents (in, var, and reduce) so that, for example, follower iterators can explicitly handle vectorization scenarios.

This issue will discuss several alternatives for exact syntax for this new strategy.

Design Direction

for loops today

for loops are serial in that they don’t create any qthreads-style tasks. However, vectorization might occur as on optimization. In other words, vectorization will occur only when the compiler can prove it is safe to do so (i.e. it won’t change the correctness of the program).

A for loop will use a serial iterator.

forall loops today

A forall loops always indicates that the loop body could be run in parallel - but isn’t necessarily run in parallel. So, use of the forall keyword indicates that the loop is both order independent and serializable. (See e.g. https://chapel-lang.org/docs/master/language/spec/data-parallelism.html#execution-and-serializability). Vectorization on the inner loops will be hinted to LLVM if the iterators implementing the forall are order independent (after PR #16152).

A forall loop will use a standalone or leader/follower iterators - depending on if zippering occurs.

Follower iterators and vectorization

So, back to the question faced at the outset about order independence in follower iterators. One viewpoint is that a follower iterator should be using forall if it is writing an order independent loop. The problem with doing that is that forall generally means creating qthreads-style tasks (although this is up to the parallel iterator invoked). However, these follower iterators could use a forall that does not create qthreads-style tasks.

However, no matter how an order-independent loop is expressed in the follower, it is important that the compiler can inline this pattern.

vectorizeOnly

Currently we have vectorizeOnly as a way to request vectorization. This currently works with for and forall loops. However, in the context of considering vector lanes to be tasks, this is an odd state to be in:

  • for does not indicate order-independence but order-independence is a precondition to indicating vectorize-ability. In other words, why should for + vectorizeOnly indicate order independence when elswhere it is forall that does so?
  • for with vectorizeOnly does not allow for per-vector-lane variables or reductions the way that forall intents does
  • forall + vectorizeOnly is redundant, because almost all forall loops are already vectorizable. The main difference is that it does not create qthreads-style tasks, but we need a way to write that already for the follower loops.
  • vectorizeOnly forces the compiler think of the loop as order independent rather than allowing the called iterator to indicate if it is order independent itself.

Could vectorizeOnly be used to mark order-independent loops in follower iterators? Not as vectorizeOnly currently functions:

  • Today, a vectorizeOnly iterator forces vectorize-ability; so, for myiter() would be vectorized if myiter contained vectorizeOnly
  • the follower iterator might use another iterator that itself might not be order independent. If the follower calls another iterator, like order-independent-for someOtherIterator() then the resulting iterator should only be order independent if someOtherIterator was marked so. But vectorizeOnly is more of an assertion and it doesn’t care about the iterator it’s wrapping - for vectorizeOnly(someOtherIterator()) would make that loop order-independent regardless of what someOtherIterator did.

This issue proposes that vectorizeOnly be deprecated and code using it instead would use a new mechanism.

supporting in var and reduce intents

The no-qthreads order-independent loop syntax needs to support also similar features to forall intents. The reason for this is that the order-independent loop can be vectorized and in that event has the potential to have parallel lock-step tasks implementing it. In order to have those tasks communicate safely in the common cases, they will need reduce intents and also general features along the lines of TODO see task numbering.

which iterator is invoked?

Some brainstorming ideas in this direction brought up in #7761 include forall(serial) and adding foreach as a means to do indicate order independence without creating qthreads-style tasks. But, these directions is that they have a hidden challenge. In particular - which iterator would these call? Would we need to have a different leader/follower/standalone for the no-tasks-but-vectorize case?

It would be simplest, in terms of compiler implementation, if the new no-qthreads order-independent loop mechanism called the serial iterator. However, it might be more intuitive if they call a parallel iterator and squash the parallelism.

An additional wrinkle is that a follower iterator can’t generally use coforall or forall in a yielding loop (or else we get “invalid use of parallel construct in serial iterator”). The compilation errors for this case would need to be tightened.

Note that even if the serial iterator is called in this order-independent case - the serial iterator can still adapt to vectorization by itself using the new mechanism along with reduce intents (say).

A further option would be for the new no-qthreads order-independent loop mechanism to call a new iterator overload, if available. Adding such a feature in the future would not be breaking as long as it continues to fall back on whatever we start with. (E.g., if we start with the new mechanism calling the serial iterator; it will still do that as long as the new iterator overload is not present, so won’t be backwards-breaking).

Proposals

Proposal 1: forall and loop configuration

This proposal is to add functionality to the forall loop to avoid creating any qthreads-style tasks and then use this functionality with forall to indicate an order-independent loop in a follower iterator. This approach makes it clear that these order-independent loops can use forall intents to create create per-vector-lane variables or handle reductions across vector lanes.

We could imagine adding an argument to parallel iterators indicating the number of tasks to create, or an argument indicating not to create any tasks at all. However, the number of tasks created by a parallel iterator is something that almost every parallel iterator already has to consider. As a result, it seems that such an argument would add more boiler-plate code to writing parallel iterators.

This issue proposes adding an idea of loop configuration to a forall loop. The loop configuration will be available to both iterator code being called as well as to the loop body.

This issue further proposes that the forallwith syntax be able to supply the loop configuration. For a first straw-person example:

forall x in myObject with(orderIndependentOnly=true) { }

What is the Type of a Loop Configuration?

Note that the loop configuration needs to be something that one iterator can pass to another iterator (as it is very common for iterators to call each other).

If #16388 were addressed, we could consider using a tuple with named elements for the loop configuration.
In the meantime this issue will propose that the loop configuration is something record-like (but that is not a normal record because of the way the compiler needs to work with it).

It might also be useful to be able to store a loop configuration in a variable in the event that there became quite a few settings. E.g.

var loopConfig = new loopConfig(orderIndependentOnly=true);
forall x in myObject with(loopConfig) { }

However this is not necessary to support at first, as long as the loop configuration can be passed between iterators.

What information is in the Loop Configuration?

I expect that the loop configuration will include the following information:

  • how many qthreads-style tasks to create, if any at all
  • the shape of the overall forall - when it is known
  • GPU kernel launch information - most commonly the warp size

How does an iterator implementation use the loop configuration?

A follower iterator wanting to indicate it is order independent would use the forall loop with the loop configuration that prevents launching qthreads-style tasks:

proc myiter(... tag=follower, followThis) {
  forall i in 1..n with (orderIndependentOnly=true) {
     yield f(i);
  }
}

The compiler would arrange for forall i in 1..n with (orderIndependentOnly=true) to call the range serial iterator.

A leader iterator would respond to the loop configuration when choosing the number of tasks to create. Instead of this:

proc myiter(... tag=leader) {
  const numChunks = _computeNumChunks(len);
  coforall chunk in 0..#numChunks {
    ...
  }
}

it would be this:

proc myiter(... tag=leader) {
  const numChunks = _computeNumChunks(Forall.configuration, len);
  coforall chunk in 0..#numChunks {
    ...
  }
}

and _computeNumChunks would access a target number of tasks to create from within the loop configuration.

Why not use for loops to express vectorization?

For loops don’t support the forall-intents like reduce, in, and var. These will be important for writing vectorized code. Additionally, as we expose SIMT features (see SIMT and SIMD above), these will only be available within forall loops.

Why not an annotation?

One might imagine that instead of writing

forall x in myObject with(vectorizeOnly=true) { }

one could write

@vectorizeOnly
forall x in myObject { }

While that would be OK for things like flags, other loop configuration - such as the number of qthreads style tasks to create - will depend on runtime values. It would be strange to have something like this:

proc f(nTasks: int) {
  @coforallTasks(nTasks)
  forall x in myObject { }
}

Proposal 2: foreach

This proposal adds a new kind of loop, a foreach loop, that is order independent but does not create multiple qthreads-style tasks. The foreach loop supports similar intents to forall loops.

This proposal is inspired by this comment: https://github.com/chapel-lang/chapel/issues/7761#issuecomment-690670109

For example, this would be an order-independent loop without creating qthreads-tasks:

foreach x in myObject { }

How does an iterator implementation use the loop configuration?

A follower iterator wanting to indicate it is order independent would use the foreach loop:

proc myiter(... tag=follower, followThis) {
  foreach i in 1..n {
     yield f(i);
  }
}

The compiler would arrange for foreach i in 1..n to call the range serial iterator.

forall-like intents

It would be possible to use something like forall intents with foreach, e.g.:

// does a vectorization-safe reduction
foreach i in 1..10 with (+ reduce x) { ... }

// copies `a` to a local variable for each vector lane
foreach i in 1..10 with (in a) { ... }

// creates a `b` variable for each vector lane
foreach i in 1..10 with (var b: int) { ... }