Chapel not working for OmniPath slurm

Compiled Chapel multi-locale configuration isn't working with Intel OmniPath interconnect. I have slurm queue in the cluster. The following job script executes job on another node spawned by this job script. Then it is stuck.
Compilation command:
$CHPL_HOME/bin/linux64-x86_64/chpl test-locales.chpl --fast -o test-locales
Job script:

#!/bin/bash

#SBATCH -t 0:10:0
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --partition=p.test
#SBATCH --output=output.chapel

export CHPL_HOME=/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
export CHPL_COMM=gannet
export CHPL_COMM_SUBSTRATE=ofi

export FI_PROVIDER=psm2
export CHPL_LAUNCHER=slurm-gasnetrun_ofi
export GASNET_OFI_SPAWNER=mpi

export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM=bundled
export CHPL_TARGET_CPU=skylake
export GASNET_BACKTRACE=1

# $CHPL_HOME/bin/linux64-x86_64/chpl test-locales.chpl --fast -o test-locales

# Set the Chapel program and dynamic number of locales

export PROG="./test-locales"

export ARGS="-nl $SLURM_NNODES" # Dynamically set the number of locales

# Run the Chapel program using srun

echo "Running Chapel program with $SLURM_NNODES locales..."

$PROG $ARGS

Please help.

Hi @dutta-alankar —

Welcome to Chapel Discourse! (though I'm sorry that it's with a problem).

Can you share the output of your run if you add a --verbose flag to the $ARGS variable? This should show the commands that the Chapel program is executing on your behalf to try and get things up and running, and may help us understand where things are going wrong.

-Brad

@bradcray Thanks!
For context, here is the detail of the system where I was trying to setup Chapel: https://docs.mpcdf.mpg.de/doc/computing/clusters/systems/Astrophysics/MPA-FREYA.html
Here is the dump produced when I run the code with --verbose

Running Chapel program with 1 locales...
salloc --quiet -J CHPL-test-local -N 1 --ntasks=1 --exclusive --partition=p.test  /freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin//gasnetrun_ofi -n 1 -N 1 -c 0 -E 'SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,SLURM_PRIO_PROCESS,XKEYSYMDB,OMPI_MCA_btl_openib_allow_ib,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,ENVIRONMENT,PROG,GPG_TTY,LESS_ADVANCED_PREPROCESSOR,OPENMPI_HOME,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,SLURM_JOB_GID,CHPL_LAUNCHER,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,SLURM_NTASKS,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,SLURM_THREADS_PER_CORE,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,SLURM_JOB_NAME,GCC_HOME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_HINT,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SLURM_NPROCS,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,SLURM_LOCALID,DOXYGEN_HOME,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_,'  /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_K7GAhZ '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 14915
[0] [New LWP 14918]
[0] [New LWP 14919]
[0] [New LWP 14920]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x000014c7dd9bb76f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/10.3.0/lib64/libstdc++.so.6.0.28-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                           Frame 
[0] * 1    Thread 0x14c7df824200 (LWP 14915) "test-locales_re" 0x000014c7dd9bb76f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x14c7da1dd700 (LWP 14918) "test-locales_re" 0x000014c7dd9e51e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x14c7d33eb700 (LWP 14919) "test-locales_re" 0x000014c7dd9f1e1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x14c7c71a2700 (LWP 14920) "test-locales_re" 0x000014c7dd9e51e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x14c7c71a2700 (LWP 14920) "test-locales_re"):
[0] #0  0x000014c7dd9e51e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014c7dcd7c7e5 in ips_ptl_pollintr (rcvthreadc=0x14c7c71a1d80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x000014c7de92b6ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x000014c7dd9f1a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x14c7d33eb700 (LWP 14919) "test-locales_re"):
[0] #0  0x000014c7dd9f1e1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x000014c7db9a65e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014c7db9a9519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014c7d9d690fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x000014c7de92b6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014c7dd9f1a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x14c7da1dd700 (LWP 14918) "test-locales_re"):
[0] #0  0x000014c7dd9e51e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014c7db9b14ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014c7db9a9519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014c7db96d24e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x000014c7de92b6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014c7dd9f1a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x14c7df824200 (LWP 14915) "test-locales_re"):
[0] #0  0x000014c7dd9bb76f in wait4 () from /lib64/libc.so.6
[0] #1  0x000014c7dd932bc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 14915) detached]
[freyag02:14915] *** Process received signal ***
[freyag02:14915] Signal: Aborted (6)
[freyag02:14915] Signal code:  (-547201364)
[freyag02:14915] [ 0] /lib64/libc.so.6(+0x4ad70)[0x14c7dd924d70]
[freyag02:14915] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x14c7dd924cdb]
[freyag02:14915] [ 2] /lib64/libc.so.6(abort+0x177)[0x14c7dd926375]
[freyag02:14915] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freyag02:14915] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freyag02:14915] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freyag02:14915] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freyag02:14915] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freyag02:14915] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freyag02:14915] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freyag02:14915] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x14c7dd90f2bd]
[freyag02:14915] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freyag02:14915] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 14915 on node freyag02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Although I ask for one node for testing, I see two running spawned by Chapel:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            748422    p.test CHPL-tes    adutt  R       0:07      1 freyag02
            748421    p.test chapel-j    adutt  R       0:08      1 freyag01

On running it interactively, I get the following:

$ ./test-locales -nl 1 --verbose
salloc --quiet -J CHPL-test-local -N 1 --ntasks=1 --exclusive --partition=p.test  /freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin//gasnetrun_ofi -n 1 -N 1 -c 0 -E 'SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,XKEYSYMDB,OMPI_MCA_btl_openib_allow_ib,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,GPG_TTY,LESS_ADVANCED_PREPROCESSOR,OPENMPI_HOME,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,CHPL_LAUNCHER,MACHTYPE,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,SLURM_NTASKS,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_THREADS_PER_CORE,_CE_CONDA,LIBGL_DEBUG,__MODULES_LMALTNAME,SLURM_JOB_NAME,GCC_HOME,PROFILEREAD,LIBRARY_PATH,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_JOB_NUM_NODES,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_HINT,CHPL_LLVM,SLURM_NPROCS,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,DOXYGEN_HOME,CVS_RSH,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_,'  /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_kubrY9 '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 15261
[0] [New LWP 15262]
[0] [New LWP 15263]
[0] [New LWP 15264]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x000014dd49d5776f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/10.3.0/lib64/libstdc++.so.6.0.28-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                           Frame 
[0] * 1    Thread 0x14dd4bbc0200 (LWP 15261) "test-locales_re" 0x000014dd49d5776f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x14dd465dd700 (LWP 15262) "test-locales_re" 0x000014dd49d811e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x14dd3f6eb700 (LWP 15263) "test-locales_re" 0x000014dd49d8de1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x14dd335aa700 (LWP 15264) "test-locales_re" 0x000014dd49d811e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x14dd335aa700 (LWP 15264) "test-locales_re"):
[0] #0  0x000014dd49d811e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014dd491187e5 in ips_ptl_pollintr (rcvthreadc=0x14dd335a9d80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x000014dd4acc76ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x000014dd49d8da8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x14dd3f6eb700 (LWP 15263) "test-locales_re"):
[0] #0  0x000014dd49d8de1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x000014dd47d425e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014dd47d45519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014dd3fd8c0fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x000014dd4acc76ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014dd49d8da8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x14dd465dd700 (LWP 15262) "test-locales_re"):
[0] #0  0x000014dd49d811e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014dd47d4d4ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014dd47d45519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014dd47d0924e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x000014dd4acc76ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014dd49d8da8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x14dd4bbc0200 (LWP 15261) "test-locales_re"):
[0] #0  0x000014dd49d5776f in wait4 () from /lib64/libc.so.6
[0] #1  0x000014dd49ccebc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 15261) detached]
[freyag02:15261] *** Process received signal ***
[freyag02:15261] Signal: Aborted (6)
[freyag02:15261] Signal code:  (1268522668)
[freyag02:15261] [ 0] /lib64/libc.so.6(+0x4ad70)[0x14dd49cc0d70]
[freyag02:15261] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x14dd49cc0cdb]
[freyag02:15261] [ 2] /lib64/libc.so.6(abort+0x177)[0x14dd49cc2375]
[freyag02:15261] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freyag02:15261] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freyag02:15261] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freyag02:15261] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freyag02:15261] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freyag02:15261] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freyag02:15261] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freyag02:15261] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x14dd49cab2bd]
[freyag02:15261] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freyag02:15261] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 15261 on node freyag02 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The code is taken from example:

for loc in Locales do on loc do
  writeln((here.name, here.maxTaskPar));

@bradcray Somehow my reply was categorized as spam and hidden. Can you please see if you can unhide this?

1 Like

@dutta-alankar —

[Sorry about the spam flag — Discourse is a bit skittish with new users (and yet somehow still lets spam through). Fortunately, these are flagged very clearly, so I would've seen and addressed this without the ping.]

Let me see if I can get others on our team who are more slurm-savvy to look at this and check my work, but: If I'm understanding your setup correctly, I believe you are allocating a slurm partition and then on that partition, running a Chapel program that was built using the slurm-gasnetrun_ofi launcher. I believe this has the impact of having the Chapel program request additional resources from slurm (the salloc call that appears in your verbose output), and it may be that something about that doesn't work right, either due to your system configuration or how our code deals with nested slurm partitions.

In contrast, I think we typically expect Chapel programs built with slurm-gasnetrun_ofi to execute from a login/interactive node, to do the slurm command, and start running there, so if that's an option, I'd be curious if you had better results.

Alternatively, if you want to do the slurm interactions yourself, I believe you could use CHPL_LAUNCHER=gasnetrun_ofi, use your existing script, and that GASNet's launcher should detect that it's running within a Slurm partition and do the right thing?

Again, I feel outside of my area of expertise here, so let me see if I can get someone to double-check my work. And I'll be curious if what I'm describing sounds accurate/reasonable to you, or if I've misunderstood something.

Thanks,
-Brad

For those subscribed by email, I had a typo in my previous response, typing gasnetrun_ibv where I meant gasnetrun_ofi. I've edited the message on Discourse to refelct that.

-Brad

The problem is that you must use CHPL_LAUNCHER=gasnetrun_mpi if you are going to invoke the Chapel executable from within an sbatch script. sbatch (effectively) calls salloc to allocate nodes, then invokes the Chapel program on one of the allocated nodes, which because the launcher is slurm_gasnetrun_mpi calls salloc again and allocates more nodes. You can see that in the first line of the -v output, and also because you see two nodes running Chapel instead of one. The failures is in gasnetc_ofi_init, so the Chapel program did start running, but I suspect that the system is not configured correctly by the second salloc, which was invoked by a compute node and not the head node. In my experience this never works. Try setting CHPL_LAUNCHER=gasnetrun_mpi and see if that fixes the problem.

John

1 Like

I tried this. But now Chapel cannot seem to find gasnetrun_mpi binaries. Looks like something is not getting built. I tried to recompile afresh but still this issue persists.

/freya/ptmp/mpa/adutt/chapel-multi_locale/ssh/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_mpi -n 1 -N 1 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,SLURM_PRIO_PROCESS,XKEYSYMDB,OMPI_MCA_btl_openib_allow_ib,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,ENVIRONMENT,PROG,GPG_TTY,LESS_ADVANCED_PREPROCESSOR,OPENMPI_HOME,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,SLURM_JOB_GID,CHPL_LAUNCHER,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,SLURM_NTASKS,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,SLURM_THREADS_PER_CORE,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,SLURM_JOB_NAME,GCC_HOME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_HINT,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SLURM_NPROCS,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,SLURM_LOCALID,DOXYGEN_HOME,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
internal error: execvp() failed for command /freya/ptmp/mpa/adutt/chapel-multi_locale/ssh/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_mpi: No such file or directory
adutt@freya01:chapel-multi_locale$ echo $CHPL_LAUNCHER 
gasnetrun_mpi

This now doesn't spawn multiple jobs but still runs into an error.

Running Chapel program with 1 locales...
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_ofi -n 1 -N 1 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,SLURM_PRIO_PROCESS,XKEYSYMDB,OMPI_MCA_btl_openib_allow_ib,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,ENVIRONMENT,PROG,GPG_TTY,LESS_ADVANCED_PREPROCESSOR,OPENMPI_HOME,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,SLURM_JOB_GID,CHPL_LAUNCHER,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,SLURM_NTASKS,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,SLURM_THREADS_PER_CORE,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,SLURM_JOB_NAME,GCC_HOME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_HINT,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SLURM_NPROCS,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,SLURM_LOCALID,DOXYGEN_HOME,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_xzzncP '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 36813
[0] [New LWP 36814]
[0] [New LWP 36815]
[0] [New LWP 36816]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x00001492abf2e76f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/10.3.0/lib64/libstdc++.so.6.0.28-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                           Frame 
[0] * 1    Thread 0x1492add97200 (LWP 36813) "test-locales_re" 0x00001492abf2e76f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x1492a87dd700 (LWP 36814) "test-locales_re" 0x00001492abf581e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x1492a1853700 (LWP 36815) "test-locales_re" 0x00001492abf64e1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x149295769700 (LWP 36816) "test-locales_re" 0x00001492abf581e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x149295769700 (LWP 36816) "test-locales_re"):
[0] #0  0x00001492abf581e9 in poll () from /lib64/libc.so.6
[0] #1  0x00001492ab2ef7e5 in ips_ptl_pollintr (rcvthreadc=0x149295768d80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x00001492ace9e6ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x00001492abf64a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x1492a1853700 (LWP 36815) "test-locales_re"):
[0] #0  0x00001492abf64e1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x00001492a9f195e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x00001492a9f1c519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x00001492a3d8c0fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x00001492ace9e6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x00001492abf64a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x1492a87dd700 (LWP 36814) "test-locales_re"):
[0] #0  0x00001492abf581e9 in poll () from /lib64/libc.so.6
[0] #1  0x00001492a9f244ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x00001492a9f1c519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x00001492a9ee024e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x00001492ace9e6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x00001492abf64a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x1492add97200 (LWP 36813) "test-locales_re"):
[0] #0  0x00001492abf2e76f in wait4 () from /lib64/libc.so.6
[0] #1  0x00001492abea5bc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 36813) detached]
[freyag01:36813] *** Process received signal ***
[freyag01:36813] Signal: Aborted (6)
[freyag01:36813] Associated errno: Unknown error 32765 (32765)
[freyag01:36813] Signal code:  (223)
[freyag01:36813] [ 0] /lib64/libc.so.6(+0x4ad70)[0x1492abe97d70]
[freyag01:36813] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x1492abe97cdb]
[freyag01:36813] [ 2] /lib64/libc.so.6(abort+0x177)[0x1492abe99375]
[freyag01:36813] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freyag01:36813] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freyag01:36813] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freyag01:36813] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freyag01:36813] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freyag01:36813] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freyag01:36813] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freyag01:36813] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x1492abe822bd]
[freyag01:36813] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freyag01:36813] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 36813 on node freyag01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

The only config that partially works is

export CHPL_LAUNCHER=gasnetrun_ofi
export GASNET_OFI_SPAWNER=ssh

and gives the output

Running Chapel program with 1 locales...
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_ofi -n 1 -N 1 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,OMPI_MCA_btl_openib_allow_ib,SLURM_PRIO_PROCESS,XKEYSYMDB,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,ENVIRONMENT,PROG,GPG_TTY,OPENMPI_HOME,LESS_ADVANCED_PREPROCESSOR,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,SLURM_JOB_GID,CHPL_LAUNCHER,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,GCC_HOME,SLURM_JOB_NAME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_HINT,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,DOXYGEN_HOME,SLURM_LOCALID,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
0: using core(s) 0-39
oversubscribed = False
QTHREADS: Using 40 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8384512 byte stack size.
comm task bound to accessible PUs
PSHM is disabled.
executing locale 0 of 1 on node 'freyag01'
(freyag01, 40)

But on requesting more than one node it fails

Running Chapel program with 2 locales...
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_ofi -n 2 -N 2 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,OMPI_MCA_btl_openib_allow_ib,SLURM_PRIO_PROCESS,XKEYSYMDB,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,ENVIRONMENT,PROG,GPG_TTY,OPENMPI_HOME,LESS_ADVANCED_PREPROCESSOR,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,SLURM_JOB_GID,CHPL_LAUNCHER,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,GCC_HOME,SLURM_JOB_NAME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_HINT,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,DOXYGEN_HOME,SLURM_LOCALID,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 2 --verbose
*** SSH-SPAWNER (freyag01:2567): Failed to start processes on freyag02, possibly due to an inability to establish an ssh connection from freyag01 without interactive authentication.
*** FATAL ERROR (freyag01:2567): in reap_one() at net/gasnet-src/other/ssh-spawner/gasnet_bootstrap_ssh.c:525: One or more processes died before setup was completed
*** WARNING (freyag01:2567): Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init

test-locales_real:2567 terminated with signal 6 at PC=154a412fbcdb SP=7ffcce7bae60.  Backtrace:
/lib64/libc.so.6(gsignal+0x10d)[0x154a412fbcdb]
/lib64/libc.so.6(abort+0x177)[0x154a412fd375]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5a6079]
/lib64/libc.so.6(+0x4ad70)[0x154a412fbd70]
/lib64/libpthread.so.0(accept+0x13)[0x154a4230d811]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5a0c0e]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x59f9bf]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x59f3ec]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5967c4]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x505180]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
/lib64/libc.so.6(__libc_start_main+0xef)[0x154a412e62bd]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]

Using

export GASNET_OFI_SPAWNER=mpi

fails with even one node

Running Chapel program with 1 locales...
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_ofi -n 1 -N 1 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,OMPI_MCA_btl_openib_allow_ib,SLURM_PRIO_PROCESS,XKEYSYMDB,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,ENVIRONMENT,PROG,GPG_TTY,OPENMPI_HOME,LESS_ADVANCED_PREPROCESSOR,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,SLURM_JOB_GID,CHPL_LAUNCHER,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,GCC_HOME,SLURM_JOB_NAME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_HINT,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,DOXYGEN_HOME,SLURM_LOCALID,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_GNlyhn '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 2951
[0] [New LWP 2952]
[0] [New LWP 2953]
[0] [New LWP 2954]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x000014b70ddcd76f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/10.3.0/lib64/libstdc++.so.6.0.28-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                          Frame 
[0] * 1    Thread 0x14b70fc36200 (LWP 2951) "test-locales_re" 0x000014b70ddcd76f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x14b70a5dd700 (LWP 2952) "test-locales_re" 0x000014b70ddf71e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x14b7036eb700 (LWP 2953) "test-locales_re" 0x000014b70de03e1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x14b6f75ac700 (LWP 2954) "test-locales_re" 0x000014b70ddf71e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x14b6f75ac700 (LWP 2954) "test-locales_re"):
[0] #0  0x000014b70ddf71e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014b70d18e7e5 in ips_ptl_pollintr (rcvthreadc=0x14b6f75abd80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x000014b70ed3d6ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x000014b70de03a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x14b7036eb700 (LWP 2953) "test-locales_re"):
[0] #0  0x000014b70de03e1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x000014b70bdb85e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014b70bdbb519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014b703d8c0fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x000014b70ed3d6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014b70de03a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x14b70a5dd700 (LWP 2952) "test-locales_re"):
[0] #0  0x000014b70ddf71e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014b70bdc34ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014b70bdbb519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014b70bd7f24e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x000014b70ed3d6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014b70de03a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x14b70fc36200 (LWP 2951) "test-locales_re"):
[0] #0  0x000014b70ddcd76f in wait4 () from /lib64/libc.so.6
[0] #1  0x000014b70dd44bc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 2951) detached]
[freyag01:02951] *** Process received signal ***
[freyag01:02951] Signal: Aborted (6)
[freyag01:02951] Signal code:  (262373036)
[freyag01:02951] [ 0] /lib64/libc.so.6(+0x4ad70)[0x14b70dd36d70]
[freyag01:02951] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x14b70dd36cdb]
[freyag01:02951] [ 2] /lib64/libc.so.6(abort+0x177)[0x14b70dd38375]
[freyag01:02951] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freyag01:02951] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freyag01:02951] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freyag01:02951] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freyag01:02951] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freyag01:02951] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freyag01:02951] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freyag01:02951] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x14b70dd212bd]
[freyag01:02951] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freyag01:02951] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 2951 on node freyag01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Setting

export CHPL_LAUNCHER=gasnetrun_mpi

I get an error that some binary is non-existent

internal error: execvp() failed for command /freya/ptmp/mpa/adutt/chapel-multi_locale/ssh/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_mpi: No such file or directory

ls gives

adutt@freya01:chapel-2.3.0$ ls /freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/
gasnetrun_ofi  gasnetrun_ofi-mpi.pl  gasnetrun_ofi.pl  gasnet_trace  gasnet_trace.pl  ident

@dutta-alankar : Your last ssh-based result seems encouraging to me. When possible, we generally prefer using ssh to spawn rather than mpirun because it tends to have lower resource utilization while starting up the program, and while it is running.

Here are a few questions:

  • Back in your original config, did you ever try running the Chapel program outside of the slurm partition requested by your job script? That is, with CHPL_LAUNCHER=slurm-gasnetrun_ofi, did you ever try running ./test-locales -nl 1 to have Chapel request the node from slurm and launch the program?
  • For your most recent runs, you're running the slurm commands to reserve the nodes yourself, is that correct?
  • If you salloc a group of 'n' nodes, from the login node (or wherever you're running your 2-locale ssh-based run), are you able to ssh into each node you're granted in a password-less manner? (e.g., it looks to me as though maybe the login node may be able to ssh into freyag01 without a password but not freyag02?). If so, are you able to update your ssh setup to permit it?
  • I'm also curious whether, if you were to set export GASNET_SSH_SERVERS="freyag01 freyag01 freyag01 ..." to launch all locales onto the single node that you can seem to ssh into you have better results.

-Brad

As you said, I tried different combinations but looks like running chapel directly without a slurm script (and not using sbatch to submit it) always gives error. Additionally, I cannot simply ssh into the compute nodes from the login nodes. It asks for password and even my login password doesn't work to login to the compute nodes. I have a public key put in ~/.ssh and added to authorized_keys but that doesn't allow me to do a passwordless ssh to worker nodes from login.

adutt@freya01:chapel-multi_locale$ $CHPL_HOME/util/chplenv/printchplbuilds.py
                           <Current>              0                 
     CHPL_TARGET_PLATFORM: linux64              linux64             
     CHPL_TARGET_COMPILER: llvm                 llvm                
         CHPL_TARGET_ARCH: x86_64               x86_64              
          CHPL_TARGET_CPU: skylake              skylake             
        CHPL_LOCALE_MODEL: flat                 flat                
                CHPL_COMM: gasnet               gasnet              
          CHPL_COMM_DEBUG: -                    -                   
      CHPL_COMM_SUBSTRATE: ofi                  ofi                 
      CHPL_GASNET_SEGMENT: everything           everything          
               CHPL_TASKS: qthreads             qthreads            
         CHPL_TASKS_DEBUG: -                    -                   
              CHPL_TIMERS: generic              generic             
              CHPL_UNWIND: none                 none                
                 CHPL_MEM: jemalloc             jemalloc            
             CHPL_ATOMICS: cstdlib              cstdlib             
               CHPL_HWLOC: bundled              bundled             
         CHPL_HWLOC_DEBUG: -                    -                   
           CHPL_HWLOC_PCI: enable               enable              
                 CHPL_RE2: bundled              bundled             
         CHPL_AUX_FILESYS: none                 none                
             CHPL_LIB_PIC: none                 none                
        CHPL_SANITIZE_EXE: none                 none                
                    MTIME: NA                   Feb 01 09:58

Setting slurm for chapel

adutt@freya01:chapel-multi_locale$ # job specs for chapel
adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER_PARTITION=p.test
adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER_NODE_ACCESS=exclusive

MPI with slurm

adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER=slurm-gasnetrun_ofi
adutt@freya01:chapel-multi_locale$ export GASNET_OFI_SPAWNER=mpi
adutt@freya01:chapel-multi_locale$ ./test-locales -nl 1
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_o24QuV '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 33905
[0] [New LWP 33906]
[0] [New LWP 33907]
[0] [New LWP 33908]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x00001499cfb0176f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/10.3.0/lib64/libstdc++.so.6.0.28-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                           Frame 
[0] * 1    Thread 0x1499d1968200 (LWP 33905) "test-locales_re" 0x00001499cfb0176f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x1499cc3dd700 (LWP 33906) "test-locales_re" 0x00001499cfb2b1e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x1499c544d700 (LWP 33907) "test-locales_re" 0x00001499cfb37e1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x1499b92fb700 (LWP 33908) "test-locales_re" 0x00001499cfb2b1e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x1499b92fb700 (LWP 33908) "test-locales_re"):
[0] #0  0x00001499cfb2b1e9 in poll () from /lib64/libc.so.6
[0] #1  0x00001499ceec27e5 in ips_ptl_pollintr (rcvthreadc=0x1499b92fad80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x00001499d0a716ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x00001499cfb37a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x1499c544d700 (LWP 33907) "test-locales_re"):
[0] #0  0x00001499cfb37e1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x00001499cdaec5e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x00001499cdaef519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x00001499c7d8c0fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x00001499d0a716ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x00001499cfb37a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x1499cc3dd700 (LWP 33906) "test-locales_re"):
[0] #0  0x00001499cfb2b1e9 in poll () from /lib64/libc.so.6
[0] #1  0x00001499cdaf74ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x00001499cdaef519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x00001499cdab324e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x00001499d0a716ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x00001499cfb37a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x1499d1968200 (LWP 33905) "test-locales_re"):
[0] #0  0x00001499cfb0176f in wait4 () from /lib64/libc.so.6
[0] #1  0x00001499cfa78bc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 33905) detached]
[freya01:33905] *** Process received signal ***
[freya01:33905] Signal: Aborted (6)
[freya01:33905] Signal code:  (-780747092)
[freya01:33905] [ 0] /lib64/libc.so.6(+0x4ad70)[0x1499cfa6ad70]
[freya01:33905] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x1499cfa6acdb]
[freya01:33905] [ 2] /lib64/libc.so.6(abort+0x177)[0x1499cfa6c375]
[freya01:33905] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freya01:33905] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freya01:33905] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freya01:33905] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freya01:33905] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freya01:33905] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freya01:33905] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freya01:33905] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x1499cfa552bd]
[freya01:33905] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freya01:33905] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node freya01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

ssh with slurm

adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER=slurm-gasnetrun_ofi
adutt@freya01:chapel-multi_locale$ export GASNET_OFI_SPAWNER=ssh
adutt@freya01:chapel-multi_locale$ ./test-locales -nl 1
*** SSH-SPAWNER (freya01:512): Failed to start processes on freyag01, possibly due to an inability to establish an ssh connection from freya01 without interactive authentication.
*** FATAL ERROR (freya01:512): in reap_one() at net/gasnet-src/other/ssh-spawner/gasnet_bootstrap_ssh.c:525: One or more processes died before setup was completed
*** WARNING (freya01:512): Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init

test-locales_real:512 terminated with signal 6 at PC=1477e439fcdb SP=7ffedb7ec4a0.  Backtrace:
/lib64/libc.so.6(gsignal+0x10d)[0x1477e439fcdb]
/lib64/libc.so.6(abort+0x177)[0x1477e43a1375]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5a6079]
/lib64/libc.so.6(+0x4ad70)[0x1477e439fd70]
/lib64/libpthread.so.0(accept+0x13)[0x1477e53b1811]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5a0c0e]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x59f9bf]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x59f3ec]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5967c4]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x505180]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
/lib64/libc.so.6(__libc_start_main+0xef)[0x1477e438a2bd]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]

ssh without slurm

adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER=gasnetrun_ofi
adutt@freya01:chapel-multi_locale$ export GASNET_OFI_SPAWNER=ssh
adutt@freya01:chapel-multi_locale$ ./test-locales -nl 1
*** ERROR (freya01:34021): No GASNET_SSH_NODEFILE, GASNET_SSH_SERVERS, or GASNET_NODEFILE in environment
adutt@freya01:chapel-multi_locale$ export GASNET_SSH_SERVERS="freyag01 freyag02 freyag03 freyag04"
adutt@freya01:chapel-multi_locale$ ./test-locales -nl 1
*** SSH-SPAWNER (freya01:35098): Failed to start processes on freyag01, possibly due to an inability to establish an ssh connection from freya01 without interactive authentication.
*** FATAL ERROR (freya01:35098): in reap_one() at net/gasnet-src/other/ssh-spawner/gasnet_bootstrap_ssh.c:525: One or more processes died before setup was completed
*** WARNING (freya01:35098): Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init

test-locales_real:35098 terminated with signal 6 at PC=146d53eb8cdb SP=7ffca64a54e0.  Backtrace:
/lib64/libc.so.6(gsignal+0x10d)[0x146d53eb8cdb]
/lib64/libc.so.6(abort+0x177)[0x146d53eba375]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5a6079]
/lib64/libc.so.6(+0x4ad70)[0x146d53eb8d70]
/lib64/libpthread.so.0(accept+0x13)[0x146d54eca811]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5a0c0e]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x59f9bf]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x59f3ec]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5967c4]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x505180]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
/lib64/libc.so.6(__libc_start_main+0xef)[0x146d53ea32bd]
/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
adutt@freya01:chapel-multi_locale$ # mpi without slurm
adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER=gasnetrun_ofi
adutt@freya01:chapel-multi_locale$ export GASNET_OFI_SPAWNER=mpi
adutt@freya01:chapel-multi_locale$ 
adutt@freya01:chapel-multi_locale$ ./test-locales -nl 1
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_OjB3v2 '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 35784
[0] [New LWP 35803]
[0] [New LWP 35804]
[0] [New LWP 35805]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x000014e86390f76f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/10.3.0/lib64/libstdc++.so.6.0.28-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                           Frame 
[0] * 1    Thread 0x14e865776200 (LWP 35784) "test-locales_re" 0x000014e86390f76f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x14e8601dd700 (LWP 35803) "test-locales_re" 0x000014e8639391e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x14e85d42a700 (LWP 35804) "test-locales_re" 0x000014e863945e1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x14e84d178700 (LWP 35805) "test-locales_re" 0x000014e8639391e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x14e84d178700 (LWP 35805) "test-locales_re"):
[0] #0  0x000014e8639391e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014e862cd07e5 in ips_ptl_pollintr (rcvthreadc=0x14e84d177d80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x000014e86487f6ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x000014e863945a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x14e85d42a700 (LWP 35804) "test-locales_re"):
[0] #0  0x000014e863945e1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x000014e8618fa5e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014e8618fd519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014e85fd690fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x000014e86487f6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014e863945a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x14e8601dd700 (LWP 35803) "test-locales_re"):
[0] #0  0x000014e8639391e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014e8619054ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014e8618fd519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014e8618c124e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x000014e86487f6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014e863945a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x14e865776200 (LWP 35784) "test-locales_re"):
[0] #0  0x000014e86390f76f in wait4 () from /lib64/libc.so.6
[0] #1  0x000014e863886bc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 35784) detached]
[freya01:35784] *** Process received signal ***
[freya01:35784] Signal: Aborted (6)
[freya01:35784] Signal code:  (1700241068)
[freya01:35784] [ 0] /lib64/libc.so.6(+0x4ad70)[0x14e863878d70]
[freya01:35784] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x14e863878cdb]
[freya01:35784] [ 2] /lib64/libc.so.6(abort+0x177)[0x14e86387a375]
[freya01:35784] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freya01:35784] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freya01:35784] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freya01:35784] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freya01:35784] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freya01:35784] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freya01:35784] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freya01:35784] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x14e8638632bd]
[freya01:35784] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freya01:35784] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node freya01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

MPI without slurm

adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER=gasnetrun_ofi
adutt@freya01:chapel-multi_locale$ export GASNET_OFI_SPAWNER=mpi
adutt@freya01:chapel-multi_locale$ ./test-locales -nl 1
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_CbSrYJ '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 3022
[0] [New LWP 3023]
[0] [New LWP 3024]
[0] [New LWP 3025]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x000014bb4c3b076f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/10.3.0/lib64/libstdc++.so.6.0.28-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                          Frame 
[0] * 1    Thread 0x14bb4e217200 (LWP 3022) "test-locales_re" 0x000014bb4c3b076f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x14bb48bdd700 (LWP 3023) "test-locales_re" 0x000014bb4c3da1e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x14bb41c75700 (LWP 3024) "test-locales_re" 0x000014bb4c3e6e1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x14bb35bd7700 (LWP 3025) "test-locales_re" 0x000014bb4c3da1e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x14bb35bd7700 (LWP 3025) "test-locales_re"):
[0] #0  0x000014bb4c3da1e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014bb4b7717e5 in ips_ptl_pollintr (rcvthreadc=0x14bb35bd6d80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x000014bb4d3206ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x000014bb4c3e6a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x14bb41c75700 (LWP 3024) "test-locales_re"):
[0] #0  0x000014bb4c3e6e1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x000014bb4a39b5e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014bb4a39e519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014bb43d8c0fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x000014bb4d3206ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014bb4c3e6a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x14bb48bdd700 (LWP 3023) "test-locales_re"):
[0] #0  0x000014bb4c3da1e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014bb4a3a64ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014bb4a39e519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x000014bb4a36224e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x000014bb4d3206ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x000014bb4c3e6a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x14bb4e217200 (LWP 3022) "test-locales_re"):
[0] #0  0x000014bb4c3b076f in wait4 () from /lib64/libc.so.6
[0] #1  0x000014bb4c327bc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 3022) detached]
[freya01:03022] *** Process received signal ***
[freya01:03022] Signal: Aborted (6)
[freya01:03022] Associated errno: Unknown error 32766 (32766)
[freya01:03022] Signal code:  (223)
[freya01:03022] [ 0] /lib64/libc.so.6(+0x4ad70)[0x14bb4c319d70]
[freya01:03022] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x14bb4c319cdb]
[freya01:03022] [ 2] /lib64/libc.so.6(abort+0x177)[0x14bb4c31b375]
[freya01:03022] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freya01:03022] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freya01:03022] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freya01:03022] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freya01:03022] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freya01:03022] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freya01:03022] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freya01:03022] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x14bb4c3042bd]
[freya01:03022] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freya01:03022] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node freya01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Finally, following is the slurm script that works on being submitted. But it only works if the job requested is just one node and fails for more than one.

#!/bin/bash
#SBATCH -t 0:10:0
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --partition=p.test
#SBATCH --output=output.chapel

module load gcc/10 openmpi/4 hdf5-mpi/1.12.0 cmake/3.28 doxygen/1.10.0

export CHPL_HOME=/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ofi
export FI_PROVIDER=psm2
export CHPL_LAUNCHER=gasnetrun_ofi
export GASNET_OFI_SPAWNER=ssh
export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM=bundled
export CHPL_TARGET_CPU=skylake
export GASNET_BACKTRACE=1

$CHPL_HOME/bin/linux64-x86_64/chpl test-locales.chpl --fast -o test-locales

# Set the Chapel program and dynamic number of locales
export PROG="./test-locales"
export ARGS="-nl $SLURM_NNODES --verbose"  # Dynamically set the number of locales

# Run the Chapel program using srun
echo "Running Chapel program with $SLURM_NNODES locales..."
echo $CHPL_HOME
echo $CHPL_LAUNCHER
echo $GASNET_OFI_SPAWNER
$PROG $ARGS

Here is the output

adutt@freya01:chapel-multi_locale$ sbatch chapel-job 
Submitted batch job 749023
adutt@freya01:chapel-multi_locale$ tail -F ./output.chapel 
Running Chapel program with 1 locales...
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
gasnetrun_ofi
ssh
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/bin/gasnetrun_ofi -n 1 -N 1 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,SLURM_PRIO_PROCESS,XKEYSYMDB,OMPI_MCA_btl_openib_allow_ib,GASNET_BACKTRACE,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HDF5_HOME,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,HDF5_ROOT,SLURM_DISTRIBUTION,ENVIRONMENT,PROG,GPG_TTY,LESS_ADVANCED_PREPROCESSOR,OPENMPI_HOME,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,CHPL_LAUNCHER_PARTITION,SLURM_JOB_GID,CHPL_LAUNCHER,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,SLURM_NTASKS,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,SLURM_THREADS_PER_CORE,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,GASNET_SSH_SERVERS,SLURM_JOB_NAME,GCC_HOME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,OMPI_MCA_pml,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,CHPL_LAUNCHER_NODE_ACCESS,SLURM_HINT,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SLURM_NPROCS,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,OMPI_MCA_btl_openib_if_include,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,OMPI_MCA_mtl,HISTSIZE,CPU,SLURM_LOCALID,DOXYGEN_HOME,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
0: using core(s) 0-39
oversubscribed = False
QTHREADS: Using 40 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8384512 byte stack size.
comm task bound to accessible PUs
PSHM is disabled.
executing locale 0 of 1 on node 'freyag01'
(freyag01, 40)

I also wanted to add the contents of slurm.conf for the cluster if that helps.

$ cat /etc/slurm/slurm.conf
#
# slurm.conf file on DRACO cluster
#
ClusterName=freya
ControlMachine=freyaio2
ControlAddr=freyaio2
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
CryptoType=crypto/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/save
SlurmdSpoolDir=/var/spool/slurmd/
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmdPidFile=/var/run/slurm/slurmd.pid
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup
#PluginDir=
#CacheGroups=0
#FirstJobId=
ReturnToService=1
JobRequeue=1
MaxJobCount=30000
GresTypes=gpu
JobSubmitPlugins=lua
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=ALL
PropagateResourceLimitsExcept=MEMLOCK
Prolog=/etc/slurm/scripts/prolog
PrologFlags=contain
Epilog=/etc/slurm/scripts/epilog
#SrunProlog=
#SrunEpilog=
TaskProlog=/etc/slurm/scripts/prolog.task
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskPluginParam=Sched
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
UsePAM=1
#
# TIMERS
##SlurmctldTimeout=300
SlurmdTimeout=300
#SlurmctldTimeout=1200
#SlurmdTimeout=1200
InactiveLimit=0
MinJobAge=300
KillWait=60
CompleteWait=62
KillOnBadExit=1
Waittime=0
UnkillableStepTimeout=120
#
# SCHEDULING
SchedulerType=sched/backfill
SchedulerParameters=bf_window=2880,enable_user_top
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
SelectType=select/cons_res
#SelectTypeParameters=CR_CPU_Memory
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK
#FastSchedule=1
PriorityType=priority/multifactor
PriorityFavorSmall=YES
#PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING
PriorityFlags=FAIR_TREE,CALCULATE_RUNNING
PriorityDecayHalfLife=7-0
PriorityMaxAge=7-0
#PriorityUsageResetPeriod=14-0
#PriorityUsageResetPeriod=NOW
PriorityWeightAge=1000
PriorityWeightFairshare=10000000
PriorityWeightJobSize=1000
PriorityWeightPartition=10000
#PriorityWeightQOS=100000
PriorityWeightTRES=GRES/gpu=1000000 #,GRES/gpu:p100=10000,GRES/gpu:v100=10000
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/filetxt
JobCompLoc=/var/spool/slurm/acct/history
#DebugFlags=NO_CONF_HASH,Gres,CPU_Bind
DebugFlags=NO_CONF_HASH
##DebugFlags=Priority
#
HealthCheckInterval=180
HealthCheckProgram=/etc/slurm/load-sensor
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
##JobAcctGatherParams=NoShared
JobAcctGatherParams=UsePss
JobContainerType=job_container/none,job_container/tmpfs
#
#AccountingStorageType=accounting_storage/filetxt
#AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=associations,qos,limits
AccountingStorageTRES=gres/gpu,gres/gpu:p100,gres/gpu:v100,gres/gpu:a100
AccountingStorageHost=freyaio2
AccountingStoragePort=6829
##AccountingStorageLoc=/var/spool/slurm/acct/acct.dat
#AccountingStoragePass=
#AccountingStorageUser=

# TOPOLOGY
TopologyPlugin=topology/tree

SlurmctldParameters=node_reg_mem_percent=60
CommunicationParameters=block_null_hash
#---------------------------------------------------------------------------------------------------------
# COMPUTE NODES
#---------------------------------------------------------------------------------------------------------
NodeName=DEFAULT         Sockets=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=254000 CpuBind=core State=UNKNOWN
NodeName=freya[073-104,109-176] RealMemory=190000 Weight=1    Feature=mem192G
NodeName=freya[105-108]  RealMemory=380000 Weight=10    Feature=mem384G
NodeName=freyag[01-04]   RealMemory=380000 Weight=10000 Feature=mem384G,gpu Gres=gpu:p100:2
NodeName=freyag[05-08]   RealMemory=380000 Weight=100   Feature=mem384G,gpu Gres=gpu:p100:2
NodeName=freyag[09-12]   RealMemory=380000 Weight=1000  Feature=mem384G,gpu Gres=gpu:v100:2
NodeName=freya01         RealMemory=190000
NodeName=freya[03-04]    RealMemory=190000
NodeName=freya02         RealMemory=380000
NodeName=freyag[201-211] CoresPerSocket=24 Weight=1000 RealMemory=380000 Feature=mem384G,gpu Gres=gpu:a100:4
#NodeName=freyator	 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=380000
#NodeName=virgo           Sockets=4 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=1030000



#---------------------------------------------------------------------------------------------------------
# PARTITION SETTINGS
#---------------------------------------------------------------------------------------------------------
EnforcePartLimits=YES
PartitionName=DEFAULT      State=UP MinNodes=1 MaxNodes=8 PreemptMode=REQUEUE Priority=1000 DisableRootJobs=YES
PartitionName=p.24h        Nodes=freya[073-176]    AllocNodes=freya[01-04],freya[001-176],freyag[01-12,201-210] MaxNodes=36 MaxTime=24:00:00 DefMemPerNode=188000 Priority=10  OverSubscribe=Exclusive QOS=normal DEFAULT=YES
PartitionName=p.gpu        Nodes=freyag[05-12]     AllocNodes=freya[01-04],freya[001-176],freyag[01-12]         MaxNodes=12 MaxTime=24:00:00 MaxMemPerNode=380000 Priority=10  OverSubscribe=Exclusive QOS=normal
PartitionName=p.test       Nodes=freyag[01-04]     AllocNodes=freya[01-04],freyag[01-04]                        MaxNodes=4 MaxTime=00:30:00 DefMemPerNode=9500 MaxMemPerNode=380000  Priority=10000 OverSubscribe=NO QOS=normal  AllowQos=normal
PartitionName=p.gpu.ampere Nodes=freyag[201-211]   AllocNodes=freya[01-04],freyag[201-211]                      MaxNodes=10 MaxTime=24:00:00 DefMemPerNode=95000 MaxMemPerNode=380000 Priority=10  OverSubscribe=NO QOS=gpu.ampere AllowQos=gpu.ampere
#PartitionName=p.384G      Nodes=freya[105-108],freyag[01-08]  MaxNodes=12  MaxTime=24:00:00 MaxMemPerNode=370000 Priority=1400 OverSubscribe=no AllowGroups=rzs Hidden=true
#PartitionName=p.24h       Nodes=freya[001-108],freyag[01-08] MaxNodes=36 MaxTime=24:00:00 MaxMemPerNode=188000 Priority=1200 OverSubscribe=Exclusive QOS=normal DEFAULT=YES
#PartitionName=test        Nodes=freya[001-108],freyag[01-08] MaxNodes=36 MaxTime=24:00:00 MaxMemPerNode=188000 Priority=1200 OverSubscribe=Exclusive QOS=nodequota
#PartitionName=p.gpu       Nodes=freyag[01-08]  MaxNodes=24 MaxTime=24:00:00 MaxMemPerNode=254000 Priority=1200 OverSubscribe=Exclusive
#PartitionName=s.48h       Nodes=freya[001-108] MaxTime=48:00:00 Priority=1100 OverSubscribe=FORCE DefMemPerCPU=6340
#PartitionName=test       Nodes=freya[001-108] MaxTime=30:00 Priority=1100 OverSubscribe=FORCE DefMemPerCPU=6340 AllowAccounts=rzs Hidden=true
PartitionName=rzgmon      Nodes=freya[01-04]   State=INACTIVE Hidden=True

Thanks for all the data, @dutta-alankar. I feel most motivated to focus on the ssh-based cases, primarily to avoid the MPI dependency combined with the aforementioned potential to impact performance.

ssh

In some past cases, we've had users who were able to address issues with ssh-ing to nodes that they'd allocated through slurm by working with their sysadmins—either by adjusting something in their environment, or something in the system's config. Is that something you could approach Freya's sysadmins about? Essentially, "To get up and rununing, the software I'm using would like to do a [password-less] ssh from the login node into the nodes I've allocated with slurm. However, I'm finding that I can't ssh to those nodes with or without a password. Is there something I could do, or that could be changed in the configuration of Freya or slurm to permit that?" If there is, then I'm cautiously optimistic that the ssh with slurm case above would work and be your most attractive option.

mpi

If there's not, then we'll need to wrestle with one of the MPI configurations further, where it looks as though they're failing similarly. So again, I'd probably focus on the mpi with slurm case because I think it'll be more attractive if we can get it working. I've also found an internal issue where a user hit a similar error condition and worked around it by using ssh rather than mpi.

I think the next step if we were to wrestle more with mpi would be to take the salloc command that Chapel prints when you run it in --verbose mode and to run it manually, adding a -v option to the gasnetrun_ofi invocation, which will cause it to print more verbose output as well. I'm not confident that this will indicate the source of the problem, but it's worth gathering in any case.

If it doesn't, we should then look into building a debug version of GASNet, as the error message suggests. With a very quick check, I believe we'd want to add --enable-debug to the configuration step, which I think you should be able to do by adding:

CHPL_GASNET_CFG_OPTIONS += --enable-debug

to $CHPL_HOME/third-party/gasnet/Makefile outside of any conditionals. You'd then need to re-build GASNet, which should be do-able by going to $CHPL_HOME/third-party/gasnet, running make clobber and then popping back up to $CHPL_HOME and doing a fresh make (you might inspect the GASNet configure line during that build to make sure the new option got added as expected before it takes all the time to build it).

Meanwhile, I'll check with the GASNet team to see whether that error message points to any likely/familiar problems.

If it becomes more efficient at some point, we could look at doing a screen share to try and work through this, but at present, I'm not sure it'd be helpful since the ssh issue probably needs eyes from sysadmins, and I think we have enough things to do on the mpi side if we spend more effort there.

-Brad

Aha — The GASNet team says that a classic issue with using mpi as the spawner on OmniPath is that the network may enforce a single open per process, where MPI (as the spawner) is getting there first and effectively blocking GASNet. The two ways to work around this (noted within the mpi-spawner’s README, I'm learning) are to instruct OpenMPI not to use ofi/OPA as a transport, by doing one of:

  • Setting the environment variable OMPI_MCA_btl=tcp,self.
  • Passing "--mca btl tcp,self" to mpirun

(where the first of these two would be simplest).

Note that it's this sort of competition for resources that makes me generally prefer the ssh spawner, so while I'm curious whether this makes the mpi spawning work, I'd still like to push more on whether you can get ssh working on this system, if possible.

-Brad

1 Like

I will follow up with the sysadmins. Meanwhile, on trying to compile GASNet with the debug option gives error:

In file included from comm-gasnet-ex.c:23:
In file included from /freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-unknown-llvm-none/substrate-ofi/seg-everything/include/gasnet.h:11:
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-unknown-llvm-none/substrate-ofi/seg-everything/include/gasnetex.h:1135:6: error: Tried to compile GASNet client code with optimization enabled but also GASNET_DEBUG (which seriously hurts performance). Reconfigure/rebuild GASNet without --enable-debug
 1135 |     #error Tried to compile GASNet client code with optimization enabled but also GASNET_DEBUG (which seriously hurts performance). Reconfigure/rebuild GASNet without --enable-debug
      |      ^
1 error generated.

Contents of chplconfig file

# CHPL_TARGET_CPU=skylake
CHPL_COMM=gasnet
CHPL_COMM_SUBSTRATE=ofi
# CHPL_LAUNCHER=gasnetrun_mpi
CHPL_LLVM=bundled

Further, using export OMPI_MCA_btl=tcp,self doesn't fix the problem with mpi.

Following up with the sysadmin, looks like ssh method is not available. Here's what they said.

Hi Alankar —

Bummer about the response on ssh. It's not obvious to me whether they're saying it isn't technically possible to enable ssh (which would be surprising to me given our experience on other systems) or whether they're saying that there's a lab policy that would prevent enabling it.

Anyway, rather than focusing immediately on building GASNet with debugging (where I obviously missed a configuration option), let's have you try the

  • Setting the environment variable OMPI_MCA_btl=tcp,self

suggestion above and see if that gets your mpi configuration working without needing the debug option.

-Brad

Hi Brad,

Most likely it is Freya's config policy and the sysamis are not allowed to change causing ssh to be unavailable.
I encounter the following error on trying to run salloc -N 1 -p p.test
after compiling as

adutt@freya01:chapel-multi_locale$ ./test-locales -nl 1
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_z0BTqX '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 10220
[0] [New LWP 10221]
[0] [New LWP 10222]
[0] [New LWP 10223]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x0000155045b7576f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/11.2.0/lib64/libstdc++.so.6.0.29-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                           Frame 
[0] * 1    Thread 0x155047a1d200 (LWP 10220) "test-locales_re" 0x0000155045b7576f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x1550423dd700 (LWP 10221) "test-locales_re" 0x0000155045b9f1e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x15503b5ee700 (LWP 10222) "test-locales_re" 0x0000155045babe1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x155031304700 (LWP 10223) "test-locales_re" 0x0000155045b9f1e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x155031304700 (LWP 10223) "test-locales_re"):
[0] #0  0x0000155045b9f1e9 in poll () from /lib64/libc.so.6
[0] #1  0x0000155044f367e5 in ips_ptl_pollintr (rcvthreadc=0x155031303d80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x0000155046b246ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x0000155045baba8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x15503b5ee700 (LWP 10222) "test-locales_re"):
[0] #0  0x0000155045babe1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x0000155043b605e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x0000155043b63519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x0000155041f690fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x0000155046b246ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x0000155045baba8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x1550423dd700 (LWP 10221) "test-locales_re"):
[0] #0  0x0000155045b9f1e9 in poll () from /lib64/libc.so.6
[0] #1  0x0000155043b6b4ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x0000155043b63519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x0000155043b2724e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x0000155046b246ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x0000155045baba8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x155047a1d200 (LWP 10220) "test-locales_re"):
[0] #0  0x0000155045b7576f in wait4 () from /lib64/libc.so.6
[0] #1  0x0000155045aecbc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 10220) detached]
[freyag01:10220] *** Process received signal ***
[freyag01:10220] Signal: Aborted (6)
[freyag01:10220] Associated errno: Unknown error 32765 (32765)
[freyag01:10220] Signal code:  (223)
[freyag01:10220] [ 0] /lib64/libc.so.6(+0x4ad70)[0x155045aded70]
[freyag01:10220] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x155045adecdb]
[freyag01:10220] [ 2] /lib64/libc.so.6(abort+0x177)[0x155045ae0375]
[freyag01:10220] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freyag01:10220] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freyag01:10220] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freyag01:10220] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freyag01:10220] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freyag01:10220] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freyag01:10220] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freyag01:10220] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x155045ac92bd]
[freyag01:10220] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freyag01:10220] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10220 on node freyag01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

On trying to run with srun problem persists,

adutt@freya01:chapel-multi_locale$ srun ./test-locales -nl 1
srun: spank: option "enable-coredump" provided by both coredumpsize.so and coredumpsize.so
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1336: fi_endpoint for rdma failed: -22(Invalid argument)
*** NOTICE (proc 0): We recommend linking the debug version of GASNet to assist you in resolving this application issue.
*** Details for bug reporting (proc 0): config=RELEASE=2024.5.0,SPEC=1.20,PTR=64bit,nodebug,PAR,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native compiler=CLANG/19.1.3  sys=x86_64-pc-linux-gnu
[0] Invoking GDB for backtrace...
[0] /usr/bin/gdb -nx -batch -x /tmp/gasnet_5UUg0o '/freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real' 10146
[0] [New LWP 10147]
[0] [New LWP 10148]
[0] [New LWP 10149]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/lib64/libthread_db.so.1".
[0] 0x000014578444d76f in wait4 () from /lib64/libc.so.6
[0] To enable execution of this file add
[0] 	add-auto-load-safe-path /freya/u/system/soft/SLE_15/packages/x86_64/gcc/11.2.0/lib64/libstdc++.so.6.0.29-gdb.py
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] To completely disable this security protection add
[0] 	set auto-load safe-path /
[0] line to your configuration file "/u/adutt/.config/gdb/gdbinit".
[0] For more information about this security protection see the
[0] "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
[0] 	info "(gdb)Auto-loading safe path"
[0]   Id   Target Id                                           Frame 
[0] * 1    Thread 0x1457862f5200 (LWP 10146) "test-locales_re" 0x000014578444d76f in wait4 () from /lib64/libc.so.6
[0]   2    Thread 0x145780bdd700 (LWP 10147) "test-locales_re" 0x00001457844771e9 in poll () from /lib64/libc.so.6
[0]   3    Thread 0x145779d77700 (LWP 10148) "test-locales_re" 0x0000145784483e1f in epoll_wait () from /lib64/libc.so.6
[0]   4    Thread 0x14576fbf1700 (LWP 10149) "test-locales_re" 0x00001457844771e9 in poll () from /lib64/libc.so.6
[0] 
[0] Thread 4 (Thread 0x14576fbf1700 (LWP 10149) "test-locales_re"):
[0] #0  0x00001457844771e9 in poll () from /lib64/libc.so.6
[0] #1  0x000014578380e7e5 in ips_ptl_pollintr (rcvthreadc=0x14576fbf0d80) at /home/scm/gitrepo/ifs-all/components/psm/temp.build/BUILD/libpsm2-11.2.228/ptl_ips/ptl_rcvthread.c:379
[0] #2  0x00001457853fc6ea in start_thread () from /lib64/libpthread.so.0
[0] #3  0x0000145784483a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 3 (Thread 0x145779d77700 (LWP 10148) "test-locales_re"):
[0] #0  0x0000145784483e1f in epoll_wait () from /lib64/libc.so.6
[0] #1  0x00001457824385e7 in epoll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014578243b519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x00001457807690fe in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/openmpi/mca_pmix_pmix3x.so
[0] #4  0x00001457853fc6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x0000145784483a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 2 (Thread 0x145780bdd700 (LWP 10147) "test-locales_re"):
[0] #0  0x00001457844771e9 in poll () from /lib64/libc.so.6
[0] #1  0x00001457824434ad in poll_dispatch () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #2  0x000014578243b519 in opal_libevent2022_event_base_loop () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #3  0x00001457823ff24e in progress_engine () from /mpcdf/soft/SLE_15/packages/skylake/openmpi/gcc_10-10.3.0/4.0.7/lib/libopen-pal.so.40
[0] #4  0x00001457853fc6ea in start_thread () from /lib64/libpthread.so.0
[0] #5  0x0000145784483a8f in clone () from /lib64/libc.so.6
[0] 
[0] Thread 1 (Thread 0x1457862f5200 (LWP 10146) "test-locales_re"):
[0] #0  0x000014578444d76f in wait4 () from /lib64/libc.so.6
[0] #1  0x00001457843c4bc7 in do_system () from /lib64/libc.so.6
[0] #2  0x0000000000524b7a in gasneti_system_redirected ()
[0] #3  0x0000000000524520 in gasneti_bt_gdb ()
[0] #4  0x000000000051e36f in gasneti_print_backtrace ()
[0] #5  0x000000000040725b in gasneti_error_abort ()
[0] #6  0x0000000000406d5c in _gasneti_fatalerror ()
[0] #7  0x0000000000512a9c in gasnetc_ofi_init ()
[0] #8  0x00000000005051d6 in gex_Client_Init_GASNET_202450PARnopshmEVERYTHINGnodebugnotracenostatsnodebugmallocnosrclines ()
[0] #9  0x000000000046bd5e in chpl_comm_init ()
[0] #10 0x00000000004656b4 in chpl_rt_init ()
[0] #11 0x000000000045b726 in main ()
[0] [Inferior 1 (process 10146) detached]
[freyag01:10146] *** Process received signal ***
[freyag01:10146] Signal: Aborted (6)
[freyag01:10146] Signal code:  (-2045808980)
[freyag01:10146] [ 0] /lib64/libc.so.6(+0x4ad70)[0x1457843b6d70]
[freyag01:10146] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x1457843b6cdb]
[freyag01:10146] [ 2] /lib64/libc.so.6(abort+0x177)[0x1457843b8375]
[freyag01:10146] [ 3] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407291]
[freyag01:10146] [ 4] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x406d5c]
[freyag01:10146] [ 5] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x512a9c]
[freyag01:10146] [ 6] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x5051d6]
[freyag01:10146] [ 7] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x46bd5e]
[freyag01:10146] [ 8] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x4656b4]
[freyag01:10146] [ 9] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x45b726]
[freyag01:10146] [10] /lib64/libc.so.6(__libc_start_main+0xef)[0x1457843a12bd]
[freyag01:10146] [11] /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real[0x407e7a]
[freyag01:10146] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 10146 on node freyag01 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
srun: error: freyag01: task 0: Exited with exit code 134
srun: launch/slurm: _step_signal: Terminating StepId=750491.0

Chapel config used

adutt@freya01:chapel-multi_locale$ export CHPL_HOME=/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
adutt@freya01:chapel-multi_locale$ export CHPL_COMM=gasnet
adutt@freya01:chapel-multi_locale$ export CHPL_COMM_SUBSTRATE=ofi
adutt@freya01:chapel-multi_locale$ export FI_PROVIDER=psm2
adutt@freya01:chapel-multi_locale$ export CHPL_LAUNCHER=gasnetrun_ofi
adutt@freya01:chapel-multi_locale$ export GASNET_OFI_SPAWNER=mpi
adutt@freya01:chapel-multi_locale$ export HFI_NO_CPUAFFINITY=1
adutt@freya01:chapel-multi_locale$ export CHPL_LLVM=bundled
adutt@freya01:chapel-multi_locale$ export CHPL_TARGET_CPU=skylake
adutt@freya01:chapel-multi_locale$ export OMPI_MCA_btl=tcp,self
adutt@freya01:chapel-multi_locale$ export GASNET_BACKTRACE=1
adutt@freya01:chapel-multi_locale$ export OMPI_MCA_btl=tcp,self

Hi @dutta-alankar —

Well shoot, I was optimistic that would resolve things. And I need to apologize, I completely missed the final sentence in your previous response indicating you'd already tried this:

Further, using export OMPI_MCA_btl=tcp,self doesn't fix the problem with mpi.

I think from here we should return to getting that debug build of GASNet going. Let me get you better instructions than I did last time.

-Brad