New Issue: strategy for vectorization and GPU support

16403, “mppf”, “strategy for vectorization and GPU support”, “2020-09-17T19:04:27Z”

This issue discusses a vision that forall loops in user code will be translated into vectorized loop bodies and GPU kernels.

Order independence is important in order to vectorize and is related to how we imagine we will support Chapel-native GPU code generation. This issue will discuss a strategy for supporting vectorization and GPU native code generation with order independent loops. forall loops are already order independent (See e.g. https://chapel-lang.org/docs/master/language/spec/data-parallelism.html#execution-and-serializability).

See TODO other issue for a discussion of how other forall loops that don’t create qthreads-style tasks might be marked as order independent.

SIMT and SIMD

We are working towards supporting GPU programming within Chapel. A GPU has some similarities with a CPU with an advanced vector unit. In particular, once a SIMD-style instruction set is extended to support masking out vector lanes (to handle branches in the vectorized code) and to support gather and scatter (to handle loads and stores in the vectorized code), it is not so far off from a SIMT platform like a GPU. These hardware devices exist on a sort of continuum, where something like AVX-512 is actually pretty close a GPU in terms of the programming model the hardware supports. (That does not mean that the typical way to program AVX-512 is SIMT, however).

In a GPU, each SIMT-task is programmed with a separate thread of control. This is different from SIMD-style programming where one programs the vector lanes to collectively.

Additionally, it is possible to start from a SIMT-style programming model and then convert it to SIMD. That is what the RV and the ISPC (Intel® Implicit SPMD Program Compiler) do. However, it is not so clear if it is possible to translate a SIMD program to a SIMT one.

Since Chapel goals include both global-view programming and multi-resolution design, this issue proposes that Chapel support both global-view programming for vectorization/GPU support and also SIMT style programming for programmers wishing to get more into the details.

Part of the global-view programming for vectorization includes supporting features like task-local variables and reductions in a way that does not inherently inhibit vectorization. While there is some implementation work to achieve this, it also has language design implications. In particular, a Chapel programmer will need to be able to use language features to request vector-lane-local variables or reductions across vector lanes. (Otherwise, we are stuck with having compiler analysis identify reductions within serial for loops; this is what C compilers do but doesn’t make sense for a language that has reductions as a first-class feature).

What is a task?

The above considerations about supporting SIMT programming (for eventual GPU support and for CPU vectorization in a unified manner) and the considerations about reductions in global-view forall loops not inhibiting vectorization lead one to an idea that Chapel programmers will need to be aware of the potential of vectorization as creating a kind of concurrency in their loops.

Currently, the Chapel language specification uses the term task to talk about work that can happen concurrently. While we could adjust the language specification to consider tasks and vector lanes as separate ideas, this seems to go against the goal of Chapel of providing a unified view of parallel programming for many different types of systems.

As a result, this issue proposes that the term task will include a vector lane or SIMT thread in a GPU. These are limited forms of tasks (they can’t wait on each other in the same way that a qthreads task can) but they are tasks nonetheless in that they represent operations that can occur in parallel with operations from other tasks.

However this brings up a challenge that we will need a different term for a qthread-style task from a vector lane. At that rate it might make sense to have 2 different terms. A starting point might be task and lockstep task, or something along those lines.

The Two Types of Vectorization

There are actually 2 pretty different types of vectorization that will be supported by the Chapel compiler. First, there is automatic vectorization that can occur for any loop where the compiler can prove it will not impact correctness. This kind of vectorization is reasonably well supported by C compilers and by the stock LLVM loop optimization pass. This kind of vectorization generally applies to for loops - including those that come about from iterator inlining. In this issue, we will refer to this type of vectorization as automatic vectorization.

The second type of vectorization is different because it is vectorization that occurs with programmer input. In particular, in this second type of vectorization, the programmer should be able to:

  • use forall intents like reduce, in and var to create per-vector-lane operations (supporting global-view programming that is vectorized)
  • eventually, use lower level GPU programming features that require a SIMT model (such as local data store / shared memory). This will require being able to identify the current SIMT thread number within a loop. (supporting SIMT programming for multiresolution design).

We can call this second type of vectorization forall vectorization.

The serial statement and vectorization

Since the serial statement prevents multiple tasks from running, it might be reasonable for it to also prevent forall vectorization. In particular, one might want to avoid a GPU kernel launch by using serial. It would not prevent automatic vectorization in any event. This topic has its own issue - #11798. Note that if serial prevents forall vectorization, that might lead to a requirement for code cloning, since a function can be called both inside and outside of a serial block.

coforall and vectorization

Since coforall creates tasks, and tasks can include vector lanes, is there a potential that coforall will only create vector-parallelism and not qthreads-style tasks? That might be a situation that is possible with compiler analysis and optimization, but the language design is that coforall will create general tasks (that can do things like wait for each other). That means that for the near term, coforall will continue to create qthreads-style tasks.

If we defined coforall to create a combination of vector lanes and qthreads-style tasks, it would be potentially confusing that 2 very different types of parallelism are created by the same construct.

leader/follower iterators and vectorization

We could consider having the leader iterator (somehow) create vector parallelism and for the follower iterator to just operate within a single vector lane. This approach has a few advantages:

  1. The leader creates all parallelism (including vector parallelism) and the follower is entirely sequential.
  2. It is something that can be relied upon more. In the case of zippering, it might be the case that one of several follower iterator loops cannot be marked order independent. That does not present a problem in the event that the leader iterator vectorizes - since the follower iterator will just be run once per vector lane.

This approach has several drawbacks:

  1. It obfuscates the relationship between the loop to be vectorized and the vectorization
  2. It will generally imply using strided access to get a reasonable memory access order that has to be coalesced but CPU vectorization strategies probably won’t be great at that
  3. It should be up to the end forall and/or the compiler - not the iterator, to decide how many vector lanes to use. It seems that putting the vectorization in the leader causes the leader to need to know the number of vector lanes.
  4. For programmers, for straightforward memory accesses like in STREAM, the memory access order has to be strided in the leader-vectorizes case; but in the follower-vectorizes case it is the normal order.
  5. In the event that one of the follower loops is not order independent, the follower could be modified to make it so using reduce intents and TODO link to task ids; or the compiler could use a different strategy when inlining the followers that runs each non-order-independent follower in its own vector lane.

As a result of these practical drawbacks, this issue proposes the direction that follower iterators indicate vectorizability / ability to run in a GPU kernel by indicating order independence.