External Issue: Improve auto-vectorization of 'for' loops

20841, "milthorpe", "Improve auto-vectorization of 'for' loops", "2022-10-13T05:14:18Z"

Chapel has the foreach construct to indicate a loop where the iterations are independent and may therefore be reordered or vectorized. However, users coming from other imperative languages are likely at first to reach for the for loop, which specifies ordered (serial) iteration. The Chapel compiler (with its llvm backend) doesn't always vectorize for loops, which may lead to a performance surprise when what appears to be a simple loop runs significantly slower for Chapel than it does for an equivalent loop in another language.

For example, the following loops in Chapel and C are intended to be equivalent:

// Chapel
proc assign3D(a: [?adom] real(32)) {
  for k in adom.dim(2) {
    a[0,0,k] = 1.0;
    a[0,1,k] = 1.0;
    a[1,0,k] = 1.0;
    a[1,1,k] = 1.0;
    a[2,0,k] = 1.0;
    a[2,1,k] = 1.0;
  }
}
// C
void assign3D(float a[3][2][N]) {
  for (size_t k=0; k<N; k++) {
    a[0][0][k] = 1.0;
    a[0][1][k] = 1.0;
    a[1][0][k] = 1.0;
    a[1][1][k] = 1.0;
    a[2][0][k] = 1.0;
    a[2][1][k] = 1.0;
  }
}

However, when compiled as follows, the C loop vectorizes, whereas the Chapel loop does not (detailed explanation below):

$ chpl --fast -g simple.chpl --mllvm "--pass-remarks=loop-vectorize --pass-remarks-analysis=loop-vectorize"
...
remark: assign.chpl:2:0: loop not vectorized: cannot prove it is safe to reorder memory operations

$ clang -g -Ofast -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize assign.c
assign.c:5:3: remark: the cost-model indicates that interleaving is not beneficial [-Rpass-analysis=loop-vectorize]
  for (size_t k=0; k<N; k++) {
  ^
assign.c:5:3: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize]

It would be nice if the Chapel compiler could provide the necessary information to the llvm backend to allow auto-vectorization of simple loops, even when they're not explicitly marked as vectorizable via foreach.

Alternatively, it could be worthwhile to update the documentation to recommend choosing foreach where the order of iterations does not matter.

Detailed explanation

The reason the Chapel loop doesn't vectorize is that the llvm LoopVectorizer assumes that any of the six memory locations written to in the body of the loop may be an alias for any of the others, so it needs to add runtime checks against this possibility in order to vectorize. The vectorizer weighs the cost of the added runtime checks (15 pairwise checks between 6 locations) against the benefit of vectorization, and decides that it's not profitable to vectorize.

The LoopVectorizer uses a configurable VectorizeMemoryCheckThreshold, so we can get the loop to vectorize by increasing the threshold:

chpl --fast -g simple.chpl --mllvm "--pass-remarks=loop-vectorize --pass-remarks-analysis=loop-vectorize --runtime-memory-check-threshold=15"
...
remark: assign.chpl:2:0: vectorized loop (vectorization width: 8, interleaved count: 4)

However, this isn't really an ideal fix - llvm is still adding the runtime memory checks, so Chapel will still be less efficient than C in this case. I suspect that by providing the correct aliasing hints to llvm, the checks can be eliminated.

(If the Chapel loop is changed from for to foreach, the loop vectorizes correctly.)