[Chapel Merge] Automatically aggregate some communication in fora

Branch: refs/heads/master
Revision: a7d4360
Author: e-kayrakli
Log Message:

Merge pull request #16965 from e-kayrakli/auto-agg

Automatically aggregate some communication in foralls

This PR makes the compiler analyze and transform the AST to use communication
aggregation within foralls. At a high-level this is done by bringing 3 already
existing optimizations together:

  • Module-level aggregators: @ronawho has been working on them for some time,
    and they are used by Arkouda, and within the test/studies/bale on Chapel
    repo. Standalone, they can be used as TPVs in foralls, and aggregated
    assignments can be replaced by copy method calls on aggregators.

  • Unordered forall optimizations: @mppf has implemented this optimization,
    where the assignments in forall bodies can be replaced by unordered
    assignments if they are the last statements. This results in non-blocking
    communication for those operations. This doesn’t require any code
    modifications.

  • Automatic local access: This analyzes forall “signatures” and bodies to use
    localAccess instead of this in some array accesses where the compiler can
    determine that the array’s distribution and forall’s domain are aligned. User
    doesn’t need to modify the code to benefit from this.

This PR analyzes the last statements of foralls and makes them use aggregation
if:

  • they are assignments,
  • one side is provably local, where the other is likely not,
  • the assignment is safe to be unordered.

The optimization is off-by-default and can be turned on with
--auto-aggregation. --report-auto-aggregation enables some reporting during
compilation. -sverboseAggregation enables the module-level output at execution
time.

How it works

At a very high level, if you have something like:

var dom = newBlockDom({1..10});

var a: [dom] int;
var b: [dom] int;

forall i in a.domain {
  a[i] = b[calculate(i)];
}

The forall looks like the following after resolution:

forall i in a.domain with (var chpl_src_auto_agg = newSrcAggregatorForArray(b)) {

  var aggMarker: bool;
  if aggMarker {   // will disappear later, see below
    a[i] = b[calculate(i)];
  }
  else {
    chpl_src_auto_agg.copy(a[i], b[calculate(i)]);
  }
}

After we do unordered forall optimization (during LICM) this turns into:

forall i in a.domain with (var chpl_src_auto_agg = newSrcAggregatorForArray(b)) {
  chpl_src_auto_agg.copy(a[i], b[calculate(i)]);
}

In more details this PR modifies operations that happen in 4 different passes:

normalize

  • After doing the pass over foralls for automatic local access, we go through
    last statements inside them in a separate loop.
  • A = as a last statement is an aggregation candidate if:
    • At least one side is a PRIM_MAYBE_LOCAL_THIS or
      PRIM_MAYBE_LOCAL_ARR_ELEM (a new primitive)
    • If there’s a side that isn’t, it “looks like” array access. (foo(bar))
  • We keep track of these candidates.

resolution

  • When we prefold PRIM_MAYBE_LOCAL_*, we check if it was part of an
    PRIM_MAYBE_AGGREGATE_ASSIGN.
    • If the other side is still PRIM_MAYBE_LOCAL_*, we don’t do anything as
      we’ll revisit this candidate when the other side is resolved. (we do this by
      checking/setting some arguments in PRIM_MAYBE_AGGREGATE_ASSIGN`
    • Else:
      • If we have confirmed this primitive to actually represent a local
        access, then, the parent =, is still an aggregation candidate. So we add
        a shadow variable to the forall, replace this candidate with:

        var aggMarker: bool;
        if aggMarker then
          // candidate (original assignment)
        else
          // call aggregator
        
        We'll revisit and finalize this optimization during LICM, where we also do the unordered forall optimization. We do it there, because we still need to do bunch of analysis to make sure that changing the execution order is safe. (This already exists on master)

        However, this poses a problem. We are still in resolution and will go
        through bunch of passes that can modify a potential = or an
        aggregation call. So, we need them both in the AST and we need them both
        going through the regular compilation process. Moreover, there are
        bigger structural changes that happen and that we want these calls to go
        through while they are in the AST.

        One alternative that I thought about was to add something similar to
        ContextCallExpr but haven’t done that. I am not sure if we can make
        that work through inlining. Maybe a potential AggregatedCallExpr can
        store two BlockStmts just like the conditional, so that we can store
        more than 1 expression for one of the alternatives.

      • Else, its parent = is not an aggregation candidate. Take no action.

callDestructors

  • When we normally do an analysis on last statements, we handle conditionals as
    sketched above separately. For the purposes of this analysis, if the last
    statement in a forall’s body is a conditional as above, we analyze the only
    call in its thenStmt as “the last statement”.

loopInvariantCodeMotion

  • When we normally do the transformation for unordered forall optimization, we
    also check whether this “last statement” is inside the then statement of one
    of those conditionals.

    • If that’s the case then it must be an aggregatable call. We replace the
      whole conditional with the else block (which at this point contains an
      already-inlined call to the aggregator’s copy)
    • Else, we make it an unordered call as we normally do. (no change there)
  • After doing the unordered forall optimization early in LICM, we also do a pass
    over CondStmts in the tree, and if we encounter any one that is added for
    aggregation, we revert that and do the necessary cleanup.

Implementation notes

  • Modify preNormalizeOptimizations to implement most of the optimization.

  • Modify optimizeForallUnorderedOps to handle special conditionals created
    for this optimization.

  • Modify lowerForalls to propagate aggregation-related flags to the cloned
    TPVs.

  • Add PRIM_MAYBE_LOCAL_ARR_ELEM to detect array element symbols yielded by
    array iterators as local.

  • Add PRIM_MAYBE_AGGREGATE_ASSIGN to encapsulate assignments that can be
    aggregated.

  • Modify preFold to handle these two new primitives.

  • Add dsiIteratorYieldsLocalElements to denote array types that yield elements
    that are local.

  • Add a class AggregationCandidateInfo to record candidate assignments’ info
    during normalization. This is used at resolution and lifetime checking.

  • Add FLAG_AGG_* flags to mark symbols added for aggregation such as the
    “aggregation marker”, aggregator generator calls etc.

  • Few more additions to ForallOptimizationInfo.

[Reviewed by @mppf and @ronawho]

Followups

Test status

  • [x] standard
  • [x] standard --auto-aggregation
  • [x] gasnet
  • [x] gasnet --auto-aggregation
  • [x] release/examples --auto-aggregation valgrind
  • [x] release/examples --auto-aggregation gasnet.asan
  • [x] release/examples --verify
  • [x] release/examples --verify --auto-aggregation

Modified Files:
A compiler/include/forallOptimizations.h
A compiler/optimizations/forallOptimizations.cpp
A modules/internal/ChapelAutoAggregation.chpl
A test/optimizations/autoAggregation/COMPOPTS
A test/optimizations/autoAggregation/SKIPIF
A test/optimizations/autoAggregation/basicDestAgg.chpl
A test/optimizations/autoAggregation/basicDestAgg.good
A test/optimizations/autoAggregation/basicSourceAgg.chpl
A test/optimizations/autoAggregation/basicSourceAgg.good
A test/optimizations/autoAggregation/bothLocal.chpl
A test/optimizations/autoAggregation/bothLocal.good
A test/optimizations/autoAggregation/childNotAggregatable.chpl
A test/optimizations/autoAggregation/childNotAggregatable.good
A test/optimizations/autoAggregation/insideGenericFunction.chpl
A test/optimizations/autoAggregation/insideGenericFunction.good
A test/optimizations/autoAggregation/localFlagDisablesAggregation.chpl
A test/optimizations/autoAggregation/localFlagDisablesAggregation.compopts
A test/optimizations/autoAggregation/localFlagDisablesAggregation.good
A test/optimizations/autoAggregation/localVariables.chpl
A test/optimizations/autoAggregation/localVariables.good
R compiler/include/preNormalizeOptimizations.h
R compiler/optimizations/preNormalizeOptimizations.cpp
R test/studies/bale/aggregation/AggregationPrimitives.chpl
R test/studies/bale/aggregation/CopyAggregation.chpl
M compiler/AST/ForallStmt.cpp
M compiler/AST/primitive.cpp
M compiler/include/ForallStmt.h
M compiler/include/driver.h
M compiler/include/flags_list.h
M compiler/include/primitive_list.h
M compiler/main/driver.cpp
M compiler/optimizations/Makefile.share
M compiler/optimizations/optimizeForallUnorderedOps.cpp
M compiler/passes/normalize.cpp
M compiler/resolution/lowerForalls.cpp
M compiler/resolution/preFold.cpp
M man/chpl.rst
M modules/dists/BlockCycDist.chpl
M modules/dists/BlockDist.chpl
M modules/dists/CyclicDist.chpl
M modules/dists/HashedDist.chpl
M modules/dists/SparseBlockDist.chpl
M modules/internal/ChapelArray.chpl
M modules/internal/ChapelDistribution.chpl
M modules/internal/ChapelStandard.chpl
M test/compflags/bradc/help/userhelp.good
M test/compflags/ferguson/print-module-resolution.good
M test/modules/sungeun/init/printModuleInitOrder.good
M test/modules/sungeun/init/printModuleInitOrder.na-none.good
M test/optimizations/deadCodeElimination/elliot/countDeadModules.comm-ofi.good
M test/optimizations/deadCodeElimination/elliot/countDeadModules.comm-ugni.good
M test/optimizations/forall-unordered-opt/compilation-tests/COMPOPTS
M test/studies/bale/aggregation/AtomicAggregation.chpl
M test/studies/bale/indexgather/ig-forall-opt.compopts
M test/studies/bale/indexgather/ig.chpl
M test/studies/bale/indexgather/ig.compopts
M util/chpl-completion.bash

Compare: https://github.com/chapel-lang/chapel/compare/0be475754efa...a7d436020f13