[chapel-lang/chapel] Improve LLVM vectorization support

Branch: refs/heads/master
Revision: 3f5a5fc
Author: mppf
Log Message:

Merge pull request #16152 from mppf/vec-fixes

Improve LLVM vectorization support

Resolves #11636.

This PR takes several steps to improve LLVM vectorization hinting
support:

  • for LLVM vectorization hinting, use llvm.loop.parallel_accesses and
    llvm.access.group instead of the deprecated
    llvm.mem.parallel_loop_access.
  • for RV vectorization hint, use a separate idea of vectorization
    hazards (separate from hazards for the LLVM hint)
  • use a new strategy instead of markVectorizationHazards to check for
    patterns that the LLVM loop vectorizer does not support. Previously
    this was trying to guess which patterns would end up with problematic
    allocas for loop-local variables. Now it marks only variables that are
    non-stack or are declared outside of any order-indpendent loop with
    llvm.access.group; in this way if the loop-local stack memory
    variables are converted to registers as expected, the loop will be
    parallel as far as LLVM is concerned.
  • change the vectorization hinting strategy for iterators based on
    problems I observed once vectorization was enabled with vectorization
    being applied to cases where it should not. Yielding loops in
    follower/standalone iterators are no longer marked order independent.
    Instead, these are expected to be marked by the user or an earlier
    part of compilation. For now there is a pragma to do that. In the
    future we expect that it will be up to users to indicate this - this
    comment discusses possible syntax
    approaches
    .
    Also, the compiler knows how to infer order independence in the very
    common case of for x in something() do yield x. Since it might cause
    performance surprises if an iterator is not marked - serial,
    standalone, and follower iterators in the standard/internal/package
    modules must either have inferred order independence or use a pragma
    to indicate if the yielding for loop is order independent or not.

Reviewed by @e-kayrakli with input from @slnguyen - thanks!

  • [x] full local --verify testing
  • [x] full local --llvm testing
  • [x] full local --llvm --fast testing

Future work:

  • remove CHPL_PRAGMA_IVDEP entirely and --force-vectorization
    hinting for the C backend.

Modified Files:
A test/llvm/parallel_loop_access/different_numbers.compopts
A test/llvm/parallel_loop_access/generation_inside_loop.compopts
A test/llvm/parallel_loop_access/no_parallel_loop_accesses.chpl
A test/llvm/parallel_loop_access/no_parallel_loop_accesses.compopts
A test/llvm/parallel_loop_access/no_parallel_loop_accesses.good
A test/llvm/parallel_loop_access/parallel_loop_accesses1.chpl
A test/llvm/parallel_loop_access/parallel_loop_accesses1.compopts
A test/llvm/parallel_loop_access/parallel_loop_accesses1.good
A test/llvm/parallel_loop_access/parallel_loop_accesses2.chpl
A test/llvm/parallel_loop_access/parallel_loop_accesses2.compopts
A test/llvm/parallel_loop_access/parallel_loop_accesses2.good
A test/llvm/parallel_loop_access/parallel_loop_accesses3.chpl
A test/llvm/parallel_loop_access/parallel_loop_accesses3.compopts
A test/llvm/parallel_loop_access/parallel_loop_accesses3.good
A test/llvm/parallel_loop_access/parallel_loop_accesses3.noexec
A test/llvm/parallel_loop_access/simple_forall.compopts
A test/llvm/parallel_loop_access/simple_loop.compopts
A test/llvm/parallel_loop_access/zippered_forall.compopts
A test/llvm/vectorization/parallel_loop_accesses/PREDIFF
A test/llvm/vectorization/parallel_loop_accesses/SKIPIF
A test/performance/vectorization/vectorizeOnly/iterator-loop-vectorization.chpl
A test/performance/vectorization/vectorizeOnly/iterator-loop-vectorization.compopts
A test/performance/vectorization/vectorizeOnly/iterator-loop-vectorization.good
A test/performance/vectorization/vectorizeOnly/iterator-loop-vectorization.prediff
A test/performance/vectorization/vectorizeYieldingLoopsPragma/vectorZipForLoop.bad
A test/performance/vectorization/vectorizeYieldingLoopsPragma/vectorZipForLoop.future
R test/llvm/parallel_loop_access/COMPOPTS
M compiler/AST/LoopExpr.cpp
M compiler/AST/LoopStmt.cpp
M compiler/AST/foralls.cpp
M compiler/AST/type.cpp
M compiler/codegen/CForLoop.cpp
M compiler/codegen/DoWhileStmt.cpp
M compiler/codegen/LoopStmt.cpp
M compiler/codegen/WhileDoStmt.cpp
M compiler/codegen/expr.cpp
M compiler/codegen/symbol.cpp
M compiler/include/LoopStmt.h
M compiler/include/codegen.h
M compiler/include/flags_list.h
M compiler/include/genret.h
M compiler/include/type.h
M compiler/resolution/lowerIterators.cpp
M compiler/resolution/resolveFunction.cpp
M modules/dists/BlockCycDist.chpl
M modules/dists/BlockDist.chpl
M modules/dists/CyclicDist.chpl
M modules/dists/DimensionalDist2D.chpl
M modules/dists/HashedDist.chpl
M modules/dists/PrivateDist.chpl
M modules/dists/SparseBlockDist.chpl
M modules/dists/StencilDist.chpl
M modules/dists/dims/BlockCycDim.chpl
M modules/dists/dims/BlockDim.chpl
M modules/dists/dims/ReplicatedDim.chpl
M modules/internal/ArrayViewRankChange.chpl
M modules/internal/ArrayViewReindex.chpl
M modules/internal/ArrayViewSlice.chpl
M modules/internal/Bytes.chpl
M modules/internal/BytesStringCommon.chpl
M modules/internal/ChapelArray.chpl
M modules/internal/ChapelError.chpl
M modules/internal/ChapelHashtable.chpl
M modules/internal/ChapelIteratorSupport.chpl
M modules/internal/ChapelLocale.chpl
M modules/internal/ChapelRange.chpl
M modules/internal/ChapelTuple.chpl
M modules/internal/DefaultAssociative.chpl
M modules/internal/DefaultRectangular.chpl
M modules/internal/DefaultSparse.chpl
M modules/internal/String.chpl
M modules/layouts/LayoutCS.chpl
M modules/packages/DistributedBag.chpl
M modules/packages/DistributedDeque.chpl
M modules/packages/EpochManager.chpl
M modules/packages/FunctionalOperations.chpl
M modules/packages/LockFreeQueue.chpl
M modules/packages/LockFreeStack.chpl
M modules/packages/OrderedSet/Treap.chpl
M modules/packages/RangeChunk.chpl
M modules/packages/RecordParser.chpl
M modules/packages/Sort.chpl
M modules/packages/VisualDebug.chpl
M modules/standard/FileSystem.chpl
M modules/standard/Heap.chpl
M modules/standard/IO.chpl
M modules/standard/LinkedLists.chpl
M modules/standard/List.chpl
M modules/standard/Map.chpl
M modules/standard/Random.chpl
M modules/standard/Regexp.chpl
M modules/standard/Set.chpl
M modules/standard/Types.chpl
M test/llvm/parallel_loop_access/different_numbers.chpl
M test/llvm/parallel_loop_access/generation_inside_loop.chpl
M test/llvm/parallel_loop_access/simple_forall.chpl
M test/llvm/parallel_loop_access/simple_loop.chpl
M test/llvm/parallel_loop_access/zippered_forall.chpl
M test/performance/vectorization/hintTests/vec-hint-ok-harder-zips.good
M test/performance/vectorization/hintTests/vec-hint-ok-harder.good
M test/performance/vectorization/hintTests/vec-hint-ok.good
M test/performance/vectorization/hintTests/vec-no-hint.chpl
M test/performance/vectorization/hintTests/vec-no-hint.good
M test/performance/vectorization/vectorPragmas/basicIters.compgood
M test/performance/vectorization/vectorPragmas/cForLoopInParIter.compgood
M test/performance/vectorization/vectorPragmas/doWhileInParIter.chpl
M test/performance/vectorization/vectorPragmas/doWhileInParIter.compgood
M test/performance/vectorization/vectorPragmas/forallInStandalone.compgood
M test/performance/vectorization/vectorPragmas/loopWithoutYield.compgood
M test/performance/vectorization/vectorPragmas/loopsInForallNoVector.compgood
M test/performance/vectorization/vectorPragmas/nestedLoopsInFollower.compgood
M test/performance/vectorization/vectorPragmas/nonInlinableFollower.compgood
M test/performance/vectorization/vectorPragmas/whileDoInParIter.chpl
M test/performance/vectorization/vectorPragmas/whileDoInParIter.compgood
M test/performance/vectorization/vectorPragmas/zipIterInFollower.chpl
M test/performance/vectorization/vectorPragmas/zipIterInFollower.compgood
M test/performance/vectorization/vectorizeOnly/vectorizeOnlyEmitsVectorPragma.good
M test/performance/vectorization/vectorizeYieldingLoopsPragma/vectorDoWhileLoop.good
M test/performance/vectorization/vectorizeYieldingLoopsPragma/vectorForLoop.good
M test/performance/vectorization/vectorizeYieldingLoopsPragma/vectorWhileLoop.good
M test/performance/vectorization/vectorizeYieldingLoopsPragma/vectorZipForLoop.good

Compare: https://github.com/chapel-lang/chapel/compare/5532ddf30ba7...3f5a5fc1586d