Branch: refs/heads/master
Revision: d70a5d1
Author: gbtitus
Log Message:
Merge pull request #17630 from gbtitus/add-msg-ord-fence-merged
In comm=ofi, add message-order-fence MCM mode and heavily restructure.
(Walked through with and reviewed by @ronawho.)
(Thanks to @mstrout for the idea of model-specific operation functions
to reduce control-flow clutter.)
Until now, the ofi comm layer has had two ways of achieving MCM
conformance, particular with regard to the memory visibility of values
stored by PUTs and AMOs. One method uses delivery-complete completion
semantics which guarantee such visibility once the completion event is
delivered. This mode is slow because we have to wait for an entire
network round trip on every operation. The other mode uses default
completions, but adds required message ordering capabilities for RMA
read-after-write and others, and also does synthetic visibility-forcing
operations such as dummy GETs to achieve things like ensuring all
outstanding PUTs are memory-visible before the we do an on-stmt or start
a parallel task. These two modes for conforming to the Chapel MCM were
called "delivery-complete" and "message-order", respectively. The code
had a simple boolean flag that told which mode it was operating in.
Here, we add a third mode called "message-order-fence". This is like
the message-order mode in that it uses the default completion level
along with message ordering settings, but it requires the atomic
capability and only asks the provider to order atomic operations that
write. This fulfills the MCM ordering requirements for AMOs directly.
It also requires the fenced-operation capability, and uses fenced ops to
enforce the ordering and visibility requirements in other circumstances.
This new mode is only enabled only when the tasking layer uses a fixed
number of worker threads and and the provider can supply enough transmit
contexts (regular endpoints, or actual contexts with a scalable transmit
endpoint) that every worker thread can have its own dedicated transmit
context. Because libfabric can only guarantee order for operations
between a given pair of endpoints, this fixed binding between endpoints
and tasks allows delaying operations or kinds of operations (fenced, for
example) until such time as the ordering or visibility effects those ops
provide are actually needed, thus improving performance. Or to put the
other way around, message ordering guarantees are of limited usefulness
without dedicated transmit endpoints, because if a task initiates an
operation on a dynamically acquired transmit context it must ensure that
operation is complete and visible before releasing the transmit context
anyway.
The code now has an enumerated type for the operating modes and a global
variable that tells which mode it's operating in.
Provider selection and the actual per-operation code are both affected
significantly by this.
First, provider selection has been heavily restructured. We used to
just have two alternatives, and we looked for a provider that could do
one and then the other. Now we have three and want to be extensible to
more, so there's a list of functions to call that will find a provider
(if one is available) that can do a given mode, starting from the one we
expect to perform best (the new message-order-fence) to the one we
expect to be worst (delivery-complete). As before with the two-mode
code, we still first go through the modes looking for a "good" provider,
and only if we fail to find one do we go through them again accepting a
"not as good" one. ("Good" here means "not tcp or sockets", the same as
it did in the old code.) But since the comm layer has to work properly
whether it finds a provider on its own or has one forced upon it via the
environment, for each mode there is a single function that encapsulates
the corresponding capabilities and other requirements, and which is used
both when finding all providers that can support that mode and when
deciding whether a given forced provider can support it.
Because the new message-order-fence MCM mode has a requirement for a
fixed number of worker threads and a dedicated transmit context for each
thread, the logic associated with deciding whether those requirements
hold has been moved from the endpoint setup code where it used to be,
into code called much earlier during provider selection. Endpoint setup
now just uses the information developed during those earlier calls.
There are also a couple of endpoint-related changes made as part of this
portion of the work. We've reduced from two receive endpoints per
locale/node, one for send/receive (AM) traffic and one for RMA, down to
just one for all traffic. The original code separated these, with the
thinking being that there would be a load distribution benefit, but we
never saw any such. In addition, we've simplified the logic having to
do with whether or not we're using a single scalable transmit endpoint
with multiple transmit contexts, or just multiple transmit endpoints.
Finally, the line of output produced by the comm layer in "verbose" mode
(-v) now includes the MCM conformance mode in addition to the provider.
The largest part of the restructuring is that the bottom-level operation
functions, by which is meant the ones that called libfabric to do the
actual work and wait for completion if necessary, now do so through
mode-specific implementation functions. The old code had a fair amount
of mode-related control flow to initiate operations and/or wait for
completions, doing things one way for one mode and another way for the
other. With three modes now and maybe more in the future this seemed
unworkable, so now for each kind of operation (AM, PUT, GET, and AMO)
there are mode-specific functions to do the libfabric interactions for
that operation when the comm layer is operating in that mode. Thus for
example amReqFn_msgOrdFence() handles all the libfabric interactions for
AM requests when the comm layer is in message-order-fence mode. These
mode-specific implementation functions only need control flow related to
libfabric itself, things like how best to initiate the operation and
what kind of completion to ask for. They don't need MCM mode-related
control flow, because they're only called when we're in the specific MCM
mode they implement. Going forward, these will allow for mode-specific
development and maintenance to be done without having to worry about
breaking the comm layer's behavior with providers and networks that use
other modes.
As part of this, the mcmReleaseNode() and forceMemFxVis*() functions
(the latter renamed from waitForPutsVis*() to say more accurately what
they do) have been updated to work with the new message-order-fence mode
as well.
Also as part of this portion of the work, we've moved the bitmap that
records nodes to which we've done PUTs whose memory effects might not
yet be visible from task-private data to the transmit context table, and
added a corresponding bitmap of nodes to which we've done AMOs. This
move was possible because we don't use these bitmaps without transmit
contexts dedicated to worker threads in either message-order based mode,
so a context-private bitmap is by definition also private to its task.
This removes a memory allocation and free from the task creation and
destruction path. Related to having two bitmaps of nodes on which the
memory effects of operations may not yet be visibility, we added a
bitmap FOREACH macro that performs its loop body for each bit set in the
OR of two bitmaps. We also got rid of the node bitmap in the unbuffered
AMO support, since the information it held can now be put in the AMO
bitmap in the transmit context table. The unbuffered operation structs
thus don't have to be variable-length any more, simplifying the shared
(among PUT, GET, and AMO) code that creates them.
Finally, while here we made rather a laundry list of other changes.
Several have to do with managing registered memory. New mrLocalize*()
and mrUnlocalize*() functions take care of ensuring that operands which
must be accessible locally or remotely as operation sources or targets
are appropriately localized and copied in and out. In essence these
encapsulate bounce-buffering. We now record the memory region local
descriptor and remote key of the dummy variable that serves as the
source and sink of GETs and PUTs done only for memory visibility just
once, at the beginning, and distribute those around the job instead of
looking them up over and over. Finally, we reworked the functions that
get memory region local descriptors and remote keys to return boolean
instead of plain int.
Lastly, we renamed the operands of AMOs, which used to be called various
forms of "operand 1" and "operand 2", as "operand" and "comparand" to
clarify what they contain. We removed CRC-32 support for AM requests,
since the last time it was used was over a year ago and the comm layer
seems mature enough that we should not be corrupting messages any more.
And, we removed the 'inline' keyword from static function declarators
that are not definitions. It's not meaningful for those.
Modified Files:
M runtime/include/comm/ofi/chpl-comm-task-decls.h
M runtime/src/comm/ofi/comm-ofi-internal.h
M runtime/src/comm/ofi/comm-ofi.c
Compare: https://github.com/chapel-lang/chapel/compare/ed6dad3e7e9f...d70a5d19c726