Branch: refs/heads/master
Revision: a7d4360
Author: e-kayrakli
Log Message:
Merge pull request #16965 from e-kayrakli/auto-agg
Automatically aggregate some communication in foralls
This PR makes the compiler analyze and transform the AST to use communication
aggregation within forall
s. At a high-level this is done by bringing 3 already
existing optimizations together:
-
Module-level aggregators: @ronawho has been working on them for some time,
and they are used by Arkouda, and within thetest/studies/bale
on Chapel
repo. Standalone, they can be used as TPVs in foralls, and aggregated
assignments can be replaced bycopy
method calls on aggregators. -
Unordered forall optimizations: @mppf has implemented this optimization,
where the assignments in forall bodies can be replaced by unordered
assignments if they are the last statements. This results in non-blocking
communication for those operations. This doesn’t require any code
modifications. -
Automatic local access: This analyzes forall “signatures” and bodies to use
localAccess
instead ofthis
in some array accesses where the compiler can
determine that the array’s distribution and forall’s domain are aligned. User
doesn’t need to modify the code to benefit from this.
This PR analyzes the last statements of foralls and makes them use aggregation
if:
- they are assignments,
- one side is provably local, where the other is likely not,
- the assignment is safe to be unordered.
The optimization is off-by-default and can be turned on with
--auto-aggregation
. --report-auto-aggregation
enables some reporting during
compilation. -sverboseAggregation
enables the module-level output at execution
time.
How it works
At a very high level, if you have something like:
var dom = newBlockDom({1..10});
var a: [dom] int;
var b: [dom] int;
forall i in a.domain {
a[i] = b[calculate(i)];
}
The forall
looks like the following after resolution:
forall i in a.domain with (var chpl_src_auto_agg = newSrcAggregatorForArray(b)) {
var aggMarker: bool;
if aggMarker { // will disappear later, see below
a[i] = b[calculate(i)];
}
else {
chpl_src_auto_agg.copy(a[i], b[calculate(i)]);
}
}
After we do unordered forall optimization (during LICM) this turns into:
forall i in a.domain with (var chpl_src_auto_agg = newSrcAggregatorForArray(b)) {
chpl_src_auto_agg.copy(a[i], b[calculate(i)]);
}
In more details this PR modifies operations that happen in 4 different passes:
normalize
- After doing the pass over foralls for automatic local access, we go through
last statements inside them in a separate loop. - A
=
as a last statement is an aggregation candidate if:- At least one side is a
PRIM_MAYBE_LOCAL_THIS
or
PRIM_MAYBE_LOCAL_ARR_ELEM
(a new primitive) - If there’s a side that isn’t, it “looks like” array access. (
foo(bar)
)
- At least one side is a
- We keep track of these candidates.
resolution
- When we prefold
PRIM_MAYBE_LOCAL_*
, we check if it was part of an
PRIM_MAYBE_AGGREGATE_ASSIGN
.- If the other side is still
PRIM_MAYBE_LOCAL_*
, we don’t do anything as
we’ll revisit this candidate when the other side is resolved. (we do this by
checking/setting some arguments in PRIM_MAYBE_AGGREGATE_ASSIGN` - Else:
-
If we have confirmed this primitive to actually represent a local
access, then, the parent=
, is still an aggregation candidate. So we add
a shadow variable to the forall, replace this candidate with:var aggMarker: bool; if aggMarker then // candidate (original assignment) else // call aggregator
We'll revisit and finalize this optimization during LICM, where we also do the unordered forall optimization. We do it there, because we still need to do bunch of analysis to make sure that changing the execution order is safe. (This already exists on master)However, this poses a problem. We are still in resolution and will go
through bunch of passes that can modify a potential=
or an
aggregation call. So, we need them both in the AST and we need them both
going through the regular compilation process. Moreover, there are
bigger structural changes that happen and that we want these calls to go
through while they are in the AST.One alternative that I thought about was to add something similar to
ContextCallExpr
but haven’t done that. I am not sure if we can make
that work through inlining. Maybe a potentialAggregatedCallExpr
can
store twoBlockStmt
s just like the conditional, so that we can store
more than 1 expression for one of the alternatives. -
Else, its parent
=
is not an aggregation candidate. Take no action.
-
- If the other side is still
callDestructors
- When we normally do an analysis on last statements, we handle conditionals as
sketched above separately. For the purposes of this analysis, if the last
statement in a forall’s body is a conditional as above, we analyze the only
call in itsthenStmt
as “the last statement”.
loopInvariantCodeMotion
-
When we normally do the transformation for unordered forall optimization, we
also check whether this “last statement” is inside the then statement of one
of those conditionals.- If that’s the case then it must be an aggregatable call. We replace the
whole conditional with the else block (which at this point contains an
already-inlined call to the aggregator’scopy
) - Else, we make it an unordered call as we normally do. (no change there)
- If that’s the case then it must be an aggregatable call. We replace the
-
After doing the unordered forall optimization early in LICM, we also do a pass
overCondStmt
s in the tree, and if we encounter any one that is added for
aggregation, we revert that and do the necessary cleanup.
Implementation notes
-
Modify
preNormalizeOptimizations
to implement most of the optimization. -
Modify
optimizeForallUnorderedOps
to handle special conditionals created
for this optimization. -
Modify
lowerForalls
to propagate aggregation-related flags to the cloned
TPVs. -
Add
PRIM_MAYBE_LOCAL_ARR_ELEM
to detect array element symbols yielded by
array iterators as local. -
Add
PRIM_MAYBE_AGGREGATE_ASSIGN
to encapsulate assignments that can be
aggregated. -
Modify
preFold
to handle these two new primitives. -
Add
dsiIteratorYieldsLocalElements
to denote array types that yield elements
that are local. -
Add a class
AggregationCandidateInfo
to record candidate assignments’ info
during normalization. This is used at resolution and lifetime checking. -
Add
FLAG_AGG_*
flags to mark symbols added for aggregation such as the
“aggregation marker”, aggregator generator calls etc. -
Few more additions to
ForallOptimizationInfo
.
[Reviewed by @mppf and @ronawho]
Followups
-
We throw in the towel early in normalization for literal symbols. However,
they are local and can be treated as such for aggregation purposes. -
Extend automatic local access to cover indices yielded by aligned followers:
https://github.com/Cray/chapel-private/issues/1585 -
Can we write optimized foralls for promotion?
https://github.com/Cray/chapel-private/issues/1586
Test status
- [x] standard
- [x] standard --auto-aggregation
- [x] gasnet
- [x] gasnet --auto-aggregation
- [x] release/examples --auto-aggregation valgrind
- [x] release/examples --auto-aggregation gasnet.asan
- [x] release/examples --verify
- [x] release/examples --verify --auto-aggregation
Modified Files:
A compiler/include/forallOptimizations.h
A compiler/optimizations/forallOptimizations.cpp
A modules/internal/ChapelAutoAggregation.chpl
A test/optimizations/autoAggregation/COMPOPTS
A test/optimizations/autoAggregation/SKIPIF
A test/optimizations/autoAggregation/basicDestAgg.chpl
A test/optimizations/autoAggregation/basicDestAgg.good
A test/optimizations/autoAggregation/basicSourceAgg.chpl
A test/optimizations/autoAggregation/basicSourceAgg.good
A test/optimizations/autoAggregation/bothLocal.chpl
A test/optimizations/autoAggregation/bothLocal.good
A test/optimizations/autoAggregation/childNotAggregatable.chpl
A test/optimizations/autoAggregation/childNotAggregatable.good
A test/optimizations/autoAggregation/insideGenericFunction.chpl
A test/optimizations/autoAggregation/insideGenericFunction.good
A test/optimizations/autoAggregation/localFlagDisablesAggregation.chpl
A test/optimizations/autoAggregation/localFlagDisablesAggregation.compopts
A test/optimizations/autoAggregation/localFlagDisablesAggregation.good
A test/optimizations/autoAggregation/localVariables.chpl
A test/optimizations/autoAggregation/localVariables.good
R compiler/include/preNormalizeOptimizations.h
R compiler/optimizations/preNormalizeOptimizations.cpp
R test/studies/bale/aggregation/AggregationPrimitives.chpl
R test/studies/bale/aggregation/CopyAggregation.chpl
M compiler/AST/ForallStmt.cpp
M compiler/AST/primitive.cpp
M compiler/include/ForallStmt.h
M compiler/include/driver.h
M compiler/include/flags_list.h
M compiler/include/primitive_list.h
M compiler/main/driver.cpp
M compiler/optimizations/Makefile.share
M compiler/optimizations/optimizeForallUnorderedOps.cpp
M compiler/passes/normalize.cpp
M compiler/resolution/lowerForalls.cpp
M compiler/resolution/preFold.cpp
M man/chpl.rst
M modules/dists/BlockCycDist.chpl
M modules/dists/BlockDist.chpl
M modules/dists/CyclicDist.chpl
M modules/dists/HashedDist.chpl
M modules/dists/SparseBlockDist.chpl
M modules/internal/ChapelArray.chpl
M modules/internal/ChapelDistribution.chpl
M modules/internal/ChapelStandard.chpl
M test/compflags/bradc/help/userhelp.good
M test/compflags/ferguson/print-module-resolution.good
M test/modules/sungeun/init/printModuleInitOrder.good
M test/modules/sungeun/init/printModuleInitOrder.na-none.good
M test/optimizations/deadCodeElimination/elliot/countDeadModules.comm-ofi.good
M test/optimizations/deadCodeElimination/elliot/countDeadModules.comm-ugni.good
M test/optimizations/forall-unordered-opt/compilation-tests/COMPOPTS
M test/studies/bale/aggregation/AtomicAggregation.chpl
M test/studies/bale/indexgather/ig-forall-opt.compopts
M test/studies/bale/indexgather/ig.chpl
M test/studies/bale/indexgather/ig.compopts
M util/chpl-completion.bash
Compare: https://github.com/chapel-lang/chapel/compare/0be475754efa...a7d436020f13