28363, "bradcray", "Emit friendlier error when Chapel using ofi exceeds number of available endpoints", "2026-02-02T18:49:51Z"
I'm finding that when executing a Chapel program using CHPL_COMM=ofi using a node architecture and locale mapping that tries to create more endpoints than an OFI provider can support, we get the error:
OFI error: fi_enable(tcip->txCtx): Invalid resource domain
For example, on HPE Cray EX systems, the number of endpoints seems to be capped at around ~254 per process/NIC, so when mapping locales to the processors in a way that causes each locale to have more than ~254 cores, this error gets triggered.
Two workarounds that can currently be used in such cases can be to:
- Run multiple co-locales per node to divide the cores between more locales/processes and/or to potentially give different processes different NICs
- Set
CHPL_RT_COMM_OFI_EP_CNTto a lower value
This issue requests that we develop a way to programmatically query or test this limit and to issue a friendlier error in such cases, for example:
Chapel attempted to create abc endpoints, but the limit is xyz
(ideally with a pointer to documentation describing the workarounds as in Add a "troubleshooting" section to Cray (EX) docs; beef up heap size docs by bradcray · Pull Request #28360 · chapel-lang/chapel · GitHub)