Compiling 1.23.0 with OmniPath

Hello,

I am trying to compile chapel-1.23.0 on an HPC cluster with OmniPath with the following settings:

module load gcc/9.3.0 openmpi/4.0.3
export CHPL_COMM=ofi
export CHPL_LAUNCHER=mpirun4ofi
export CHPL_TARGET_CPU=native

This breaks for me inside the bundled libfabric:

In file included from /tmp/razoumov/chapel-1.23.0/third-party/libfabric/libfabric-src/prov/efa/src/efa_device.c:50:
/tmp/razoumov/chapel-1.23.0/third-party/libfabric/libfabric-src/prov/efa/src/efa.h: In function ‘efa_ep_support_rdma_read’:
/tmp/razoumov/chapel-1.23.0/third-party/libfabric/libfabric-src/prov/efa/src/efa.h:376:44: error: ‘EFADV_DEVICE_ATTR_CAPS_RDMA_READ’ undeclared (first use in this function)
376 | return efa_ep->domain->ctx->device_caps & EFADV_DEVICE_ATTR_CAPS_RDMA_READ;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tmp/razoumov/chapel-1.23.0/third-party/libfabric/libfabric-src/prov/efa/src/efa.h:376:44: note: each undeclared identifier is reported only once for each function it appears in
make[6]: *** [Makefile:13246: prov/efa/src/src_libfabric_la-efa_device.lo] Error 1
make[5]: *** [Makefile:4066: all] Error 2
make[4]: *** [Makefile:66: build-libfabric] Error 2
make[3]: *** [Makefile:125: /tmp/razoumov/chapel-1.23.0/third-party/libfabric/install/linux64-x86_64-native-gnu-none] Error 2
make[2]: *** [Makefile:79: third-party-pkgs] Error 2
make[1]: *** [Makefile:94: runtime] Error 2
make: *** [Makefile:65: comprt] Error 2

I can’t find EFADV_DEVICE_ATTR_CAPS_RDMA_READ inside the system installed /longer/path/usr/lib64/librdmacm.so* library so I don’t know which library should provide it.

I also tried disabling compilation of the bundled libfabric by setting LIBFABRIC_DIR to the system’s libfabric 1.10.1. I can see:

$ ls LIBFABRIC_DIR/lib/ ./ ../ libfabric.a libfabric.la* libfabric.so@ libfabric.so.1@ libfabric.so.1.13.1* pkgconfig/ ls $LIBFABRIC_DIR/include/rdma/
./ …/ fabric.h fi_atomic.h fi_cm.h fi_collective.h fi_domain.h fi_endpoint.h fi_eq.h fi_errno.h fi_rma.h fi_tagged.h fi_trigger.h

Then I do make clean followed by make inside $CHPL_HOME. I am getting the same error (undeclared EFADV_DEVICE_ATTR_CAPS_RDMA_READ) as before, so it seems having a valid LIBFABRIC_DIR pointing to the system’s libfabric/1.10.1 does not stop compilation of the bundled libfabric.

Any suggestions?

Thank you,

Alex.

Hello Alex –

To use the system libfabric instead of the bundled one, It should work to set the environment variable CHPL_LIBFABRIC=system, and then either set LIBFABRIC_DIR to the system libfabric directory as you did, or if you’re on a system with pkg-config support you can set PKG_CONFIG_PATH so that pkg-config can find the system libfabric.

I’ll take a look at our documentation and see if/where that needs improving.

greg

Hi Greg,

Setting CHPL_LIBFABRIC=system did the trick. It’s not mentioned anywhere in 1.23 documentation. Although it prints

Warning: $CHPL_HOME/chplconfig:line 2: “CHPL_LIBFABRIC” is not an acceptable variable

multiple times during compilation, it compiles successfully.

Now I am having trouble running newly recompiled multi-locale executables. They work with correct output, but at the end I see segmentation faults. I tried the tcp libfabric provider; other providers do not seem to be enabled. I will figure it out.

Thank you,

Alex.

Hello Alex,

A member of the Chapel team provided a custom script in order to make it possible to run on the OmniPath network.

As it is big, I’ll not post it here. I can share it with you if you want.

Best regards,

Tiago Carneiro

Note that I think @tcarneiro and @razoumov are trying to do slightly different things in that Alex is using our new CHPL_COMM=ofi (libfabric) option to target Omnipath whereas Tiago has traditionally used CHPL_COMM=gasnet and its ofi conduit. I’m not sure we have any experience within the team (or community) directly comparing these two approaches to see which is the better way to target Omnipath networks with Chapel today.

-Brad

1 Like

Hello Alex,

you can find the script I use at https://github.com/chapel-lang/chapel/issues/12990#issuecomment-564320965

Updating it to 1.23 works for me.

Best regards,

Tiago Carneiro

Thank you, Tiago – your approach and settings with CHPL_COMM=gasnet worked on our OmniPath cluster, and all tests seem to be running fine.

With CHPL_COMM=ofi:

module load libfabric/1.10.1
export CHPL_COMM=ofi
export CHPL_LIBFABRIC=system
export LIBFABRIC_DIR=$EBROOTLIBFABRIC   # inside $LIBFABRIC_DIR we have lib/libfabric* and include/rdma
export CHPL_LAUNCHER=mpirun4ofi
export CHPL_TARGET_CPU=native
export CHPL_RT_COMM_OFI_PROVIDER=psm2       # official libfabric provider on our cluster
export OMPI_MCA_mtl=ofi
export FI_PROVIDER=psm2

Chapel compiled successfully, and chpl could compile multi-locale codes, but then I was consistently getting No libfabric provider for prov_name "psm2" when trying to run the compiled codes.

I will use CHPL_COMM=gasnet for now.

Thank you everyone!

Alex.

Hello Alex, Tiago –

Ah, this is good, both the success and the information. I’ll assume that the system libfabric library does in fact include the psm2 provider, since that’s the expected provider for the network on that cluster. That being so, the No libfabric provider for prov_name "psm2" message is probably indicating that one or more of the hint settings we supply to the provider selection call is in conflict with psm2’s requirements or capabilities. I’ve logged a Chapel issue noting this.

thanks,
greg

1 Like