Chapel not working for OmniPath slurm

@dutta-alankar

I got pulled away to other things earlier before finishing looking into the --enable-debug build earlier.

Tonight, I'm finding that if I set export CHPL_GASNET_CFG_OPTIONS=--enable-debug and then build GASNet as described earlier (with no changes to the Makefile), I don't hit the error you reported above. Does taking that approach work any better to you?

If not, I'll need to check with the GASNet team to see why you're hitting that error condition when I'm not. Checking the line in question, it seems to get hit if either __OPTIMIZE__ or NDEBUG are set, but I'm not sure what would be setting those.

-Brad

No luck. I ran into the same error. I tried setting and unsetting CHPL_TARGET_CPU but it doesn't help.

***** src/mem/jemalloc/ *****
***** src/tasks/qthreads/ *****
***** src/qio/ *****
In file included from comm-gasnet-ex.c:23:
In file included from /freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-unknown-llvm-none/substrate-ofi/seg-everything/include/gasnet.h:11:
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-unknown-llvm-none/substrate-ofi/seg-everything/include/gasnetex.h:1135:6: error: Tried to compile GASNet client code with optimization enabled but also GASNET_DEBUG (which seriously hurts performance). Reconfigure/rebuild GASNet without --enable-debug
 1135 |     #error Tried to compile GASNet client code with optimization enabled but also GASNET_DEBUG (which seriously hurts performance). Reconfigure/rebuild GASNet without --enable-debug
      |      ^
***** src/qio/regex/bundled/ *****
1 error generated.
make[6]: *** [Makefile.share:49: ../../../../build/runtime/linux64/llvm/x86_64/cpu-unknown/loc-flat/comm-gasnet/ofi/everything/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/pci-enable/re2-bundled/fs-none/lib_pic-none/san-none/src/comm/gasnet/comm-gasnet-ex.o] Error 1
make[5]: *** [../../make/Makefile.runtime.foot:29: gasnet.makedir] Error 2
make[4]: *** [../make/Makefile.runtime.foot:29: comm.makedir] Error 2
make[3]: *** [make/Makefile.runtime.foot:29: src.makedir] Error 2
make[2]: *** [Makefile:49: all.helpme] Error 2
make[1]: *** [Makefile:107: runtime] Error 2
make: *** [Makefile:70: comprt] Error 2

[note: I edited this post after-the-fact to fix a typo and remove unnecessary quoting—a bad habit I picked up early in my career]

@dutta-alankar : I haven't been able to determine why our experiences are differing, but would you try the following?

cd $CHPL_HOME/runtime
make clean
make DEBUG=1 OPTIMIZE=0   # was, incorrectly, "DEBUG=1 OPTIMIZE=0"

and if that works, recompile and re-run your program, and hopefully we'll get more output from GASNet about why that call is failing?

One other thing that feels very paranoid, but I want to make sure of: I think we've been assuming you're using OpenMPI due to some of the paths and variables that show up in your output, but is there any chance that the mpirun and/or MPI you're using is not OpenMPI?

Thanks,
-Brad

Yes the MPI I'm using is openmpi. To ensure this all I'm doing is using Freya's environment modules: module purge && module load gcc/11 openmpi/4.1 cmake/3.28 doxygen/1.10.0

Besides this, I followed what you said by going to runtime directory and running make -j40 clean && make -j40 "DEBUG=1 OPTIMIZE=0" But I run into the same error that stalls the compilation.

In file included from /freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/runtime/../third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/include/gasnet.h:11:
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/runtime/../third-party/gasnet/install/linux64-x86_64-skylake-llvm-none/substrate-ofi/seg-everything/include/gasnetex.h:1135:6: error: Tried to compile GASNet client code with optimization enabled but also GASNET_DEBUG (which seriously hurts performance). Reconfigure/rebuild GASNet without --enable-debug
 1135 |     #error Tried to compile GASNet client code with optimization enabled but also GASNET_DEBUG (which seriously hurts performance). Reconfigure/rebuild GASNet without --enable-debug
      |      ^
1 error generated.

I also tried make -j40 "NDEBUG=1 __OPTIMIZE__=0" but to no avail.
Before running make, I had the following set

export CHPL_HOME=/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ofi
export FI_PROVIDER=psm2
export CHPL_LAUNCHER=gasnetrun_ofi
export GASNET_OFI_SPAWNER=mpi
export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM=bundled
export CHPL_TARGET_CPU=skylake
export OMPI_MCA_btl=tcp,self
export GASNET_BACKTRACE=1
export OMPI_MCA_btl=tcp,self
export CHPL_GASNET_CFG_OPTIONS=--enable-debug

I have also tried by unsetting CHPL_TARGET_CPU but that also doesn't work.

Another way to get a debug of the runtime is to just set CHPL_COMM_DEBUG

export CHPL_COMM_DEBUG=1
cd $CHPL_HOME/runtime
make clobber
make

I'll also note that I think there is a typo in Chapel not working for OmniPath slurm - #23 by bradcray. It should be make DEBUG=1 OPTIMIZE=0, without the quotes. Regardless, CHPL_COMM_DEBUG by itself should do the trick

-Jade

1 Like

Thanks! This was very useful and now the compilation succeeded.

Here is what I now get when I try to run the code:

adutt@freya01:chapel-multi_locale$ $CHPL_HOME/bin/linux64-x86_64/chpl test-locales.chpl --fast -o test-locales
adutt@freya01:chapel-multi_locale$ salloc -N 1 -p p.test
salloc: Granted job allocation 750670
salloc: Waiting for resource configuration
salloc: Nodes freyag01 are ready for job
adutt@freya01:chapel-multi_locale$ srun ./test-locales --verbose
srun: spank: option "enable-coredump" provided by both coredumpsize.so and coredumpsize.so
error: Specify number of locales via -nl <#> or --numLocales=<#>
srun: error: freyag01: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=750670.0
adutt@freya01:chapel-multi_locale$ srun ./test-locales -nl 1 --verbose 
srun: spank: option "enable-coredump" provided by both coredumpsize.so and coredumpsize.so
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none-debug/substrate-ofi/seg-everything/bin/gasnetrun_ofi -n 1 -N 1 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LC_ALL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,XKEYSYMDB,GASNET_BACKTRACE,CHPL_COMM_DEBUG,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HOSTNAME,OLDPWD,CHPL_TARGET_CPU,__MODULES_SHARE_MODULEPATH,CSHEDIT,GPG_TTY,LESS_ADVANCED_PREPROCESSOR,OPENMPI_HOME,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,CHPL_LAUNCHER,MACHTYPE,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,LC_CTYPE,SLURM_NODELIST,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,_CE_CONDA,LIBGL_DEBUG,__MODULES_LMALTNAME,SLURM_JOB_NAME,GCC_HOME,PROFILEREAD,LIBRARY_PATH,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SSH_TTY,FROM_HEADER,MAIL,SLURM_JOB_NUM_NODES,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,CHPL_LLVM,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,CHPL_GASNET_CFG_OPTIONS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,G_BROKEN_FILENAMES,HISTSIZE,CPU,SSH_SENDS_LOCALE,DOXYGEN_HOME,CVS_RSH,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_,SLURM_PRIO_PROCESS,SRUN_DEBUG,SLURM_UMASK,SLURM_JOB_CPUS_PER_NODE_PACK_GROUP_0,SLURM_NTASKS,SLURM_NPROCS,SLURM_DISTRIBUTION,SLURM_STEP_ID,SLURM_STEPID,SLURM_SRUN_COMM_PORT,SLURM_JOB_UID,SLURM_JOB_USER,SLURM_WORKING_CLUSTER,SLURM_STEP_NODELIST,SLURM_STEP_NUM_NODES,SLURM_STEP_NUM_TASKS,SLURM_STEP_TASKS_PER_NODE,SLURM_STEP_LAUNCHER_PORT,SLURM_SRUN_COMM_HOST,SLURM_TOPOLOGY_ADDR,SLURM_TOPOLOGY_ADDR_PATTERN,SLURM_CPUS_ON_NODE,SLURM_CPU_BIND,SLURM_CPU_BIND_LIST,SLURM_CPU_BIND_TYPE,SLURM_CPU_BIND_VERBOSE,SLURM_TASK_PID,SLURM_NODEID,SLURM_PROCID,SLURM_LOCALID,SLURM_LAUNCH_NODE_IPADDR,SLURM_GTIDS,SLURM_JOB_GID,SLURMD_NODENAME,PMI_FD,PMI_JOBID,PMI_RANK,PMI_SIZE,JOB_TMPDIR,JOB_SHMTMPDIR,TMPDIR, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 1 --verbose
Spawner is set to MPI, but MPI support was not compiled in
usage: gasnetrun -n <n> [options] [--] prog [program args]
    options:
      -n <n>                 number of processes to run
      -N <N>                 number of nodes to run on (not always supported)
      -c <n>                 number of cpus per process (not always supported)
      -E <VAR1[,VAR2...]>    list of environment vars to propagate
      -v                     enable verbose output, repeated use increases verbosity
      -t                     test only, don't execute anything (implies -v)
      -k                     keep any temporary files created (implies -v)
      -spawner=(ssh|mpi|pmi) force use of a specific spawner
      --                     ends option parsing
srun: error: freyag01: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=750670.1

Looks like GASNet isn't getting compiled with MPI support. What flags to set to fix this?

Finally success!
I had to set CHPL_GASNET_CFG_OPTIONS=--with-mpi-cc=mpicc which I got looking at the GASNet Makefile. Now this is what I get:

adutt@freya01:chapel-multi_locale$ ./test-locales -nl 4
(freyag01, 40)
(freyag02, 40)
(freyag03, 40)
(freyag04, 40)

Here are my environment variables:

export CHPL_HOME="/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0"
export CHPL_COMM="gasnet"
export CHPL_COMM_SUBSTRATE="ofi"
export FI_PROVIDER="psm2"
export CHPL_LAUNCHER="slurm-gasnetrun_ofi"
export GASNET_OFI_SPAWNER="mpi"
export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM="bundled"
export CHPL_TARGET_CPU="skylake"
export GASNET_BACKTRACE=1
export OMPI_MCA_btl="tcp,self"
export CHPL_COMM_DEBUG=1
export CHPL_GASNET_CFG_OPTIONS="--with-mpi-cc=mpicc --enable-debug"

Presently, this is not part of the slurm script and I do not submit any job interactively as well. Chapel does the job submission when I run the code in the terminal because I have CHPL_LAUNCHER_PARTITION=p.test and CHPL_LAUNCHER_NODE_ACCESS=exclusive set.
How can I use this as a part of slurm script so that the code can run in background and dump to a file? Right now, if I put it in a slurm script, it spawns additional nodes and submits a new job on a separate set of nodes. Changing slurm-gasnet_ofi to gasnet_ofi does not change this.

1 Like

This behavior got fixed, and I could now submit with a slurm script after recompiling Chapel setting the following. Now, it doesn't launch additional jobs.

export CHPL_HOME="/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0"
export CHPL_COMM="gasnet"
export CHPL_COMM_SUBSTRATE="ofi"
export FI_PROVIDER="psm2"
export CHPL_LAUNCHER="slurm-gasnetrun_ofi"
export GASNET_OFI_SPAWNER="mpi"
export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM="bundled"
export CHPL_TARGET_CPU="skylake"
export OMPI_MCA_btl="tcp,self"
export CHPL_GASNET_CFG_OPTIONS="--with-mpi-cc=mpicc"

Following is my job script:

#!/bin/bash
#SBATCH -t 0:10:0
#SBATCH --nodes=4
#SBATCH --exclusive
#SBATCH --partition=p.test
#SBATCH --output=output.chapel

module purge && module load gcc/11 openmpi/4.1 cmake/3.28 doxygen/1.10.0

export CHPL_HOME="/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0"
export CHPL_COMM="gasnet"
export CHPL_COMM_SUBSTRATE="ofi"
export FI_PROVIDER="psm2"
export CHPL_LAUNCHER="slurm-gasnetrun_ofi"
export GASNET_OFI_SPAWNER="mpi"
export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM="bundled"
export CHPL_TARGET_CPU="skylake"
export OMPI_MCA_btl="tcp,self"
export CHPL_GASNET_CFG_OPTIONS="--with-mpi-cc=mpicc"

$CHPL_HOME/bin/linux64-x86_64/chpl test-locales.chpl --fast -o test-locales

$CHPL_HOME/util/chplenv/printchplbuilds.py

# Set the Chapel program and dynamic number of locales
export PROG="./test-locales"
export ARGS="-nl $SLURM_NNODES" # --verbose"  # Dynamically set the number of locales

# Run the Chapel program using srun
echo "Running Chapel program with $SLURM_NNODES locales..."
echo $CHPL_HOME
echo $CHPL_LAUNCHER
echo $GASNET_OFI_SPAWNER
$PROG $ARGS

In the output dump, I get a warning/error (error: The runtime has not been built for this configuration. Run $CHPL_HOME/util/chplenv/printchplbuilds.py for information on available runtimes.), but everything seems to be okay!
Any tips will absolutely helpful, and I appreciate all the help from this discourse forum, without which it wouldn't have been feasible.

error: The runtime has not been built for this configuration. Run $CHPL_HOME/util/chplenv/printchplbuilds.py for information on available runtimes.
Running Chapel program with 4 locales...
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
slurm-gasnetrun_ofi
mpi
(freyag01, 40)
(freyag02, 40)
(freyag03, 40)
(freyag04, 40)

Output with verbose on

error: The runtime has not been built for this configuration. Run $CHPL_HOME/util/chplenv/printchplbuilds.py for information on available runtimes.
                           <Current>              0                 
     CHPL_TARGET_PLATFORM: linux64              linux64             
     CHPL_TARGET_COMPILER: llvm                 llvm                
         CHPL_TARGET_ARCH: x86_64               x86_64              
          CHPL_TARGET_CPU: skylake              skylake             
        CHPL_LOCALE_MODEL: flat                 flat                
                CHPL_COMM: gasnet               gasnet              
          CHPL_COMM_DEBUG: -                    +*                  
      CHPL_COMM_SUBSTRATE: ofi                  ofi                 
      CHPL_GASNET_SEGMENT: everything           everything          
               CHPL_TASKS: qthreads             qthreads            
         CHPL_TASKS_DEBUG: -                    -                   
              CHPL_TIMERS: generic              generic             
              CHPL_UNWIND: none                 none                
                 CHPL_MEM: jemalloc             jemalloc            
             CHPL_ATOMICS: cstdlib              cstdlib             
               CHPL_HWLOC: bundled              bundled             
         CHPL_HWLOC_DEBUG: -                    -                   
           CHPL_HWLOC_PCI: enable               enable              
                 CHPL_RE2: bundled              bundled             
         CHPL_AUX_FILESYS: none                 none                
             CHPL_LIB_PIC: none                 none                
        CHPL_SANITIZE_EXE: none                 none                
                    MTIME: NA                   Feb 06 15:13        
Running Chapel program with 4 locales...
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
slurm-gasnetrun_ofi
mpi
/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0/third-party/gasnet/install/linux64-x86_64-skylake-llvm-none-debug/substrate-ofi/seg-everything/bin/gasnetrun_ofi -n 4 -N 4 -c 0 -E SLURM_MPI_TYPE,CONDA_SHLVL,LS_COLORS,LD_LIBRARY_PATH,CONDA_EXE,HOSTTYPE,SLURM_NODEID,SLURM_TASK_PID,SSH_CONNECTION,SPACK_PYTHON,LESSCLOSE,SLURM_PRIO_PROCESS,XKEYSYMDB,LANG,SLURM_SUBMIT_DIR,WINDOWMANAGER,LESS,OMPI_MCA_io,HOSTNAME,CHPL_TARGET_CPU,OLDPWD,__MODULES_SHARE_MODULEPATH,CSHEDIT,ENVIRONMENT,PROG,GPG_TTY,OPENMPI_HOME,LESS_ADVANCED_PREPROCESSOR,GASNET_OFI_SPAWNER,MPI_PATH,COLORTERM,CHOLLA_DIR,SLURM_CELL,ROCR_VISIBLE_DEVICES,SLURM_PROCID,CHPL_LAUNCHER,SLURM_JOB_GID,MACHTYPE,SLURMD_NODENAME,JOB_TMPDIR,MINICOM,SLURM_TASKS_PER_NODE,_CE_M,QT_SYSTEM_DIR,OSTYPE,XDG_SESSION_ID,MODULES_CMD,HFI_NO_CPUAFFINITY,SLURM_NNODES,USER,PAGER,DOMAIN,PLUTO_DIR,MORE,CHPL_COMM_SUBSTRATE,PWD,SLURM_JOB_NODELIST,HOME,SLURM_CLUSTER_NAME,CONDA_PYTHON_EXE,SLURM_NODELIST,SLURM_GPUS_ON_NODE,HOST,SSH_CLIENT,CHPL_COMM,XNLSPATH,CPATH,XDG_SESSION_TYPE,KRB5CCNAME,SLURM_JOB_CPUS_PER_NODE,INTERACTIVE,XDG_DATA_DIRS,MPCDF_SUBMODULE_COMBINATIONS,SLURM_TOPOLOGY_ADDR,_CE_CONDA,LIBGL_DEBUG,SLURM_WORKING_CLUSTER,__MODULES_LMALTNAME,GCC_HOME,SLURM_JOB_NAME,PROFILEREAD,TMPDIR,LIBRARY_PATH,SLURM_JOB_GPUS,SLURM_JOBID,SLURM_CONF,LOADEDMODULES,FI_PROVIDER,SLURM_NODE_ALIASES,SLURM_JOB_QOS,SLURM_TOPOLOGY_ADDR_PATTERN,SSH_TTY,FROM_HEADER,MAIL,SLURM_CPUS_ON_NODE,SLURM_JOB_NUM_NODES,SLURM_MEM_PER_NODE,LESSKEY,SPACK_ROOT,SHELL,TERM,XDG_SESSION_CLASS,CMAKE_HOME,SLURM_JOB_UID,ARGS,__MODULES_LMCONFLICT,XCURSOR_THEME,LS_OPTIONS,SLURM_JOB_PARTITION,SLURM_JOB_USER,CUDA_VISIBLE_DEVICES,CHPL_LLVM,SHLVL,SLURM_SUBMIT_HOST,G_FILENAME_ENCODING,SLURM_JOB_ACCOUNT,MANPATH,AFS,CELL,MODULEPATH,CHPL_HOME,SLURM_GTIDS,LOGNAME,DBUS_SESSION_BUS_ADDRESS,CLUSTER,XDG_RUNTIME_DIR,SYS,CHPL_GASNET_CFG_OPTIONS,XDG_CONFIG_DIRS,PATH,SLURM_JOB_ID,_LMFILES_,MODULESHOME,PKG_CONFIG_PATH,INFOPATH,JOB_SHMTMPDIR,G_BROKEN_FILENAMES,HISTSIZE,CPU,DOXYGEN_HOME,SLURM_LOCALID,CVS_RSH,GPU_DEVICE_ORDINAL,LESSOPEN,OMPI_MCA_btl,BASH_FUNC_module%%,BASH_FUNC_spack%%,BASH_FUNC__module_raw%%,BASH_FUNC__spack_shell_wrapper%%,BASH_FUNC_mc%%,BASH_FUNC_ml%%,_, /freya/ptmp/mpa/adutt/chapel-multi_locale/test-locales_real -nl 4 --verbose
0: using core(s) 0-39
oversubscribed = False
1: using core(s) 0-39
2: using core(s) 0-39
3: using core(s) 0-39
QTHREADS: Using 40 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8384512 byte stack size.
QTHREADS: Using 40 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8384512 byte stack size.
QTHREADS: Using 40 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 40 Shepherds
QTHREADS: Using 1 Workers per Shepherd
QTHREADS: Using 8384512 byte stack size.
QTHREADS: Using 8384512 byte stack size.
comm task bound to accessible PUs
PSHM is disabled.
executing locale 0 of 4 on node 'freyag01'
0: enter barrier for 'barrier before main'
executing locale 3 of 4 on node 'freyag04'
3: enter barrier for 'barrier before main'
executing locale 2 of 4 on node 'freyag03'
2: enter barrier for 'barrier before main'
executing locale 1 of 4 on node 'freyag02'
1: enter barrier for 'barrier before main'
3: enter barrier for 'fill node 0 globals buf'
0: enter barrier for 'fill node 0 globals buf'
2: enter barrier for 'fill node 0 globals buf'
1: enter barrier for 'fill node 0 globals buf'
0: enter barrier for 'broadcast global vars'
2: enter barrier for 'broadcast global vars'
3: enter barrier for 'broadcast global vars'
1: enter barrier for 'broadcast global vars'
1: enter barrier for 'pre-user-code hook: init done'
2: enter barrier for 'pre-user-code hook: init done'
3: enter barrier for 'pre-user-code hook: init done'
0: enter barrier for 'pre-user-code hook: init done'
0: enter barrier for 'pre-user-code hook: task counts stable'
1: enter barrier for 'pre-user-code hook: task counts stable'
3: enter barrier for 'pre-user-code hook: task counts stable'
0: enter barrier for 'pre-user-code hook: mem tracking inited'
2: enter barrier for 'pre-user-code hook: task counts stable'
1: enter barrier for 'pre-user-code hook: mem tracking inited'
3: enter barrier for 'pre-user-code hook: mem tracking inited'
(freyag01, 40)
2: enter barrier for 'pre-user-code hook: mem tracking inited'
(freyag02, 40)
(freyag03, 40)
(freyag04, 40)
0: enter barrier for 'stop polling'
1: enter barrier for 'stop polling'
2: enter barrier for 'stop polling'
3: enter barrier for 'stop polling'

This behavior got fixed, and I could now submit with a slurm script after recompiling Chapel setting the following. Now, it doesn't launch additional jobs.

Looking at your script, its not clear to me what you changed? I would have expected it to have export CHPL_LAUNCHER="gasnetrun_ofi"

The error about the runtime not being built means you are not running what you think. The warning means you need to rebuild the runtime because some of the chapel env variables have changed. So the launch script is failing to compile and then launching whatever was previously compiled (because the file still exists). Looking at your printchplbuilds output, it looks like you need to rebuild chapel now that CHPL_COMM_DEBUG is unset. Or, just make sure that is set in your job script (but note you will be running a debug build of the runtime with decreased performance).

@jabraham Thanks! I see. That's why I'm unable to reproduce this after a make clobber.
No, the following issue still persists.

I am not sure whats going on then. If you set export CHPL_LAUNCHER=gasnet_ofi, rebuild the runtime, and then rebuild your executable the launcher should not be invoking slurm anymore.

You said you fixed it? What did you do? Either way, if you are still having issues I am more than happy to get on a call to try and resolve this live. If you have access to the Chapel Discord or Gitter, those would be good places to setup a call. See https://chapel-lang.org/community/ for how to access them.

-Jade

1 Like

I'm glad you were able to get things working! I'm similarly unclear on what fixed things. I wouldn't expect CHPL_COMM_DEBUG to turn a non-working build and launch into a working one, and was expecting it to just give us additional information about why the launch wasn't working. Note that when you care about performance, you'll definitely want to build and run without CHPL_COMM_DEBUG because with all the debugging and checks enabled and optimizations turned off, performance will be negatively affected (possibly dramatically).

With respect to this question:

If you want Chapel to take care of the slurm commands for you (which can sometimes be helpful to ensure the right number of cores and processes are used per node), I'd expect you to be able to do this using normal shell commands, such as the following in bash:

$ ./myChapelProgram -nl 4 > myChapelProgram.out 2>&1 &

That said, I don't work with slurm very often, so could be mistaken. Are you finding that techniques like this don't work?

Thanks,
-Brad

This is what fixed it.
With debug on GASNet I could see that it wasn’t compiled with MPI. Then I looked at GASNet Makefile and added this which fixed the issue.
Of course, this was also needed.
export OMPI_MCA_btl="tcp,self"

1 Like

@jabraham Thanks a lot! Let me try this and get back to you.

I think I was able to fix this. I wasn't aware that Chapel needs to be compiled with different flags set for CHPL_LAUNCHER and thought that it is only a runtime environment variable. This was causing a mixup and I was using program compiled with one chapel flavor run with another.

Now, I have 4 seperate Chapel flavors compiled one production and one debug and also one with CHPL_LAUNCHER set to gasnetrun_ofi and other to slurm-gasnetrun_ofi (using the usual procedure of setting different directories to configure --prefix=<dir> and using make install after compilation).

Submission through Slurm job script is now possible with gasnetrun_ofi in both production and debug variants while slurm-gasnetrun_ofi allows Chapel to directly submit to Slurm provided CHPL_LAUNCHER_PARTITION (and perhaps CHPL_LAUNCHER_NODE_ACCESS; didn't test without it) is set. For ease of use, I have also put in scripts that can simply be sourced to set the environment variables depending on the Chapel flavor. Also, I think I needed the code to be compiled after switching to a different Chapel flavor.

I'm now pretty happy with how it stands and can proceed to write real codes and run on the cluster. Thanks a lot to @bradcray @jabraham for all the time and effort you have put to look into it. I appreciate your help without which it wouldn't have been possible.

1 Like

I'm glad you were able to get it to work. Note that you don't need to use different directories for different Chapel configurations, they can all co-exist in the same directory. If you try to compile a Chapel program for which a corresponding runtime does not exist, you'll get an error message that Chapel wasn't built for that configuration and a suggestion to run printchplbuilds.py. Also note that Chapel variables that are applied at runtime generally start with "CHPL_RT_", whereas variables that start with only "CHPL_" require rebuilding Chapel.

John

2 Likes

Glad everything works for you now.

Since neither I (of the GASNet team) not the Chapel team understand why --with-cc=mpicc would resolve your original problem, I want to offer an alternative response to the message "Spawner is set to MPI, but MPI support was not compiled in" (for anyone reading this in hopes of resolving a similar issue).

I suggest export CHPL_GASNET_CFG_OPTIONS=--enable-mpi-compat (in addition to any other desired options like --enable-debug). This should cause GASNet's configure step to fail if MPI spawner support cannot be compiled, with earlier output including information regarding why the support could not be compiled in.

2 Likes

@dutta-alankar — A belated observation here that didn't occur to me until I was helping another user last week. Did you ever try using GASNet's pmi option for launching rather than mpi and ssh? Since you're using slurm, it seems like there's reason to believe it might "just work" while also reducing the potential overheads and extra dependency of using MPI. If you have the chance to try this, I'd be curious what your results would be (again, our limited access to Omnipath systems means we learn most of what we know through users' experiences).

Thanks,
-Brad

I tested it with the following option but it didn't work.

module purge && module load gcc/11 openmpi/4.1 cmake/3.28 doxygen/1.10.0
export CHPL_HOME=/freya/ptmp/mpa/adutt/chapel-multi_locale/chapel-2.3.0
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ofi
export FI_PROVIDER=psm2
export CHPL_LAUNCHER=gasnetrun_ofi
export GASNET_OFI_SPAWNER=pmi
export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM=bundled
export CHPL_TARGET_CPU=skylake
export GASNET_BACKTRACE=1
export OMPI_MCA_btl=tcp,self
export CHPL_GASNET_CFG_OPTIONS="--enable-debug"
export CHPL_COMM_DEBUG=1
export OMPI_MCA_btl=tcp,self

However, I have access to another cluster called Orion (Astrophysics ORION — Technical Documentation) that also doesn't allow ssh to compute nodes for the users and uses Infiniband interconnect. There pmi works. Here are the environment variables that I used in Orion that worked.

export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv
export CHPL_LAUNCHER=gasnetrun_ibv
export GASNET_IBV_SPAWNER=pmi
export HFI_NO_CPUAFFINITY=1
export CHPL_LLVM=bundled
export CHPL_TARGET_CPU=native
export CHPL_GASNET_CFG_OPTIONS="--with-mpi-cc=mpicc --enable-mpi-compat"
export GASNET_PHYSMEM_MAX='335 GB'