20624, "jhh67", "task-overload.chpl hangs or gets an error on EX", "2022-09-02T22:38:04Z"
On an EX test/runtime/configMatters/comm/task-overload.chpl
sometimes either doesn't complete or gets an error:
> internal error: 0: comm-ofi.c:5045: OFI error: fi_recvmsg(ofi_rxEp, &ofi_msg_reqs[ofi_msg_i], (1ULL << 16)): Resource temporarily unavailable
The above error corresponds to FI_EAGAIN
.
This happens about 10% of the time with main
, and pretty much 100% of the time with PR Improve non-blocking AM performance in MCM mode message-order-fence by jhh67 · Pull Request #19778 · chapel-lang/chapel · GitHub. Changing the call to fi_recvmsg
to loop when FI_EAGAIN
is returned doesn't seem to fix the problem, although that change should probably be made anyway. Setting FI_LOG_LEVEL=warn
produces the following errors:
1 libfabric:70012:1662152006::cxi:ep_data:cxip_recv_pte_cb():2706<warn> nid000052: RXC (0x891:130:0) PtlTE 143: Flow control EQ full
2 libfabric:70012:1662152008::cxi:ep_data:cxip_ux_onload_complete():2303<warn> nid000052: RXC (0x891:130:0) PtlTE 143: Software UX list updated, 0 SW UX entries
3 libfabric:70012:1662152008::cxi:ep_data:cxip_recv_reenable():1927<warn> nid000052: RXC (0x891:130:0) PtlTE 143: Re-enabling PTE drop_count 1853 20
4 libfabric:70012:1662152008::cxi:ep_data:cxip_post_ux_onload_fc():2260<warn> nid000052: RXC (0x891:130:0) PtlTE 143: Now in RXC_FLOW_CONTROL
5 libfabric:70012:1662152008::cxi:ep_data:cxip_recv_pte_cb():2614<warn> nid000052: RXC (0x891:130:0) PtlTE 143: Now in RXC_ENABLED
6 libfabric:70012:1662152011::cxi:ep_data:cxip_recv_pte_cb():2706<warn> nid000052: RXC (0x891:130:0) PtlTE 143: Flow control EQ full
It appears that the receiver is being overrun, but why this causes it to not make progress isn't clear. Reducing the number of threads on the sender allows the test to complete, but it may just be that the sender doesn't overrun the receiver before the test completes. It's also not clear why it hangs sometimes without getting an error.