New Issue: Using co-locales on single-node ofi runs may result in "Function not implemented" errors

28373, "bradcray", "Using co-locales on single-node ofi runs may result in "Function not implemented" errors", "2026-02-05T00:30:44Z"

Today, we're finding that running single-node co-locale runs such as -nl 1x4 can result in errors of the form:

internal error: 0: comm-ofi.c:2209: OFI error: fi_domain(ofi_fabric, ofi_info, &ofi_domain, ((void*)0)): Function not implemented

where using -nl 1, -nl 4, -nl 2x4, -nl 4x4 all work fine. I.e., the error seems specific to 1-node co-locale runs.

That said, the behavior also seems to depend on the version of libfabric used. Specifically, we're seeing this error when using libfabric versions:

  • 1.22.0
  • 2.2.0rc1

But things work fine when using:

  • 1.20.1
  • 2.3.1