Branch: refs/heads/master
Revision: 3db8ffa
Author: gbtitus
Log Message:
Merge pull request #16515 from gbtitus/ofi-improve-mcm-conformance
Improve MCM conformance in comm=ofi.
(Reviewed by @ronawho.)
This is a collection of changes that improve MCM conformance in the
libfabric-based comm layer. The major changes are to favor delivery-
complete completion semantics as our solution for MCM conformance over
message ordering settings, and to improve how we ensure non-fetching
AMOs done by means of nonblocking AMs are complete as required by the
MCM.
The original “solution” to the problem of ensuring AMO completion was
added in PR #15712 and involved using non-blocking AMs, tracking the
number of such outstanding AMs and their target nodes, and following up
when needed with blocking no-op AMs to ensure the non-blocking ones were
done. Unfortunately this was both heavyweight because it involved
allocating and maintaining a numNodes-sized bitmap, and broken because
it only forced order among AMOs targeting the same node, not all nodes
as the MCM requires. So here, we remove this and instead solve the
problem by using blocking AMs so that we know when the target memory has
been updated, but not waiting for the ‘done’ indicators resulting from
those until we do the next thing with MCM implications.
Relatedly, this also adds chpl_comm_task_create() as a sibling of the
task fence function chpl_comm_task_end(), and calls this new function
from _upEndCount() before we create child tasks or groups of same. We
only need the parent to call this because its purpose is to fence the
parent’s “recent” remote operations (including non-fetching AMOs) so
that their results are definitely visible to children created later.
Because for now only comm=ofi needs this function, it’s declared by
means of the usual mechanism for optional runtime functions and only ofi
defines it.
While here I also made some performance-related changes. We now use the
“domain” threading model rather than the default “safe” one, which
reduces lock usage inside libfabric and its providers. We also now
inject sending operations when possible instead of initiating them
normally, which speeds things up slightly.
This improves performance in some tests and reduces it in others, though
overall the effect seems positive. (Some of the regressions seem to be
an unfortunate side effect of removing the bug related to non-fetching
AMO completion.) The new create-task fence function may also allow us
to improve performance for regular PUTs in the future.
This resolves Cray/chapel-private#1307.
Modified Files:
M modules/internal/ChapelBase.chpl
M runtime/include/chpl-comm.h
M runtime/include/comm/ofi/chpl-comm-impl.h
M runtime/include/comm/ofi/chpl-comm-task-decls.h
M runtime/src/comm/ofi/comm-ofi.c
Compare: https://github.com/chapel-lang/chapel/compare/8ee7f36da4cf...3db8ffa80d6a