Fatal signals in a multi-locale run

Hi Chapel team, my program get halted frequently without giving any "Chapel-specific" error. I tried to get info by setting GASNET_BACKTRACE. I only see the name of one of Functions (bold one). I am making sure to use sync variables for shared variables to avoid data race and aside from parallelization among locales I do not have any parallelization on each locale. Each local runs some sequential steps, send results to the locale 0, if the lock lets it and reads some results in the same way and continues. Any help in reading these messages would highly be appreciated.

Caught a fatal signal (proc 0): SIGSEGV(11)
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
[0] Invoking GDB for backtrace...
[0] /cvmfs/soft.computecanada.ca/gentoo/2020/usr/bin/gdb -nx -batch -x /tmp/gasnet_VNz9F9 '/project/6006050/masgari/RunningDDS/CompiledProgram4_real' 75566
[0] [New LWP 75603]
[0] [New LWP 75604]
[0] [New LWP 75605]
[0] [New LWP 75606]
[0] [New LWP 75607]
[0] [Thread debugging using libthread_db enabled]
[0] Using host libthread_db library "/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libthread_db.so.1".
[0] __lll_lock_wait (futex=futex@entry=0x7ffef56e5a60, private=0) at lowlevellock.c:52
[0] Id Target Id Frame
[0] * 1 Thread 0x7f77bdc1c8c0 (LWP 75566) "CompiledProgram" __lll_lock_wait (futex=futex@entry=0x7ffef56e5a60, private=0) at lowlevellock.c:52
[0] 2 Thread 0x7f4fd65ff700 (LWP 75603) "CompiledProgram" pthread_spin_lock () at ../sysdeps/x86_64/nptl/pthread_spin_lock.S:31
[0] 3 Thread 0x7f4fd5dfe700 (LWP 75604) "CompiledProgram" futex_abstimed_wait_cancelable (private=0, abstime=0x7f4fd5dfdc70, clockid=, expected=0, futex_word=0x7f4fd7eae1bc) at ../sysdeps/unix/sysv/linux/futex-internal.h:>[0] 4 Thread 0x7f4fd55fd700 (LWP 75605) "CompiledProgram" futex_abstimed_wait_cancelable (private=0, abstime=0x7f4fd55fcc70, clockid=, expected=0, futex_word=0x7f4fd7eae2bc) at ../sysdeps/unix/sysv/linux/futex-internal.h:>[0] 5 Thread 0x7f4fd4dfc700 (LWP 75606) "CompiledProgram" 0x00007f77bde177de in __GI___waitpid (pid=88648, stat_loc=stat_loc@entry=0x7f4fdd7f2b88, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
[0] 6 Thread 0x7f4fcffff700 (LWP 75607) "CompiledProgram" 0x00007f77bde314d7 in sched_yield () at ../sysdeps/unix/syscall-template.S:78
[0]
[0] Thread 6 (Thread 0x7f4fcffff700 (LWP 75607)):
[0] #0 0x00007f77bde314d7 in sched_yield () at ../sysdeps/unix/syscall-template.S:78
[0] #1 0x00000000008e3185 in polling ()
[0] #2 0x00000000008e106d in comm_task_wrapper ()
[0] #3 0x00007f77be3abf27 in start_thread (arg=) at pthread_create.c:479
[0] #4 0x00007f77bde4987f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0]
[0] Thread 5 (Thread 0x7f4fd4dfc700 (LWP 75606)):
[0] #0 0x00007f77bde177de in __GI___waitpid (pid=88648, stat_loc=stat_loc@entry=0x7f4fdd7f2b88, options=options@entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
[0] #1 0x00007f77bdd95777 in do_system (line=) at ../sysdeps/posix/system.c:167
[0] #2 0x000000000091f22a in gasneti_bt_gdb ()
[0] #3 0x00000000009236e8 in gasneti_print_backtrace ()
[0] #4 0x0000000000410b09 in gasneti_defaultSignalHandler ()
[0] #5
[0] #6 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:238
[0] #7 0x000000000091dca0 in gasnetc_ofi_am_send_long ()
[0] #8 0x0000000000915595 in gasnetc_AMReplyLongM ()
#9 0x00000000008e2c83 in AM_reply_put ()
[0] #10 0x000000000091bb87 in gasnetc_ofi_poll ()
[0] #11 0x0000000000910c09 in gasnetc_AMPoll ()
[0] #12 0x00000000008e35e2 in execute_on_common ()
[0] #13 0x00000000008e728d in chpl_comm_execute_on ()
[0] #14 0x00000000005143c7 in chpl_executeOn ()
[0] #15 0x0000000000471861 in release_chpl22 ()
[0] #16 0x0000000000474de6 in doClear_chpl5 ()
[0] #17 0x0000000000475b5d in deinit_chpl70 ()
[0] #18 0x000000000047669f in chpl__autoDestroy42 ()
[0] #19 0x00000000005b40a9 in on_fn_chpl236 ()
[0] #20 0x000000000059ae00 in _fireAllDestructors_chpl12 ()
[0] #21 0x00000000005a1d1f in on_fn_chpl237 ()
[0] #22 0x000000000059b337 in _fireAllDestructors_chpl2 ()
[0] #23 0x000000000062663a in FillMasterTablesSingleObjective_chpl ()
[0] #24 0x000000000067d460 in on_fn_chpl283 ()
[0] #25 0x000000000067d8ae in wrapon_fn_chpl283 ()
[0] #26 0x00000000008e2db6 in fork_wrapper ()
[0] #27 0x00000000008e15ff in chapel_wrapper ()
[0] #28 0x00000000009cb7c2 in qthread_wrapper ()
[0] #29 0x0000000000000000 in ?? ()
[0]
[0] Thread 4 (Thread 0x7f4fd55fd700 (LWP 75605)):
[0] #0 futex_abstimed_wait_cancelable (private=0, abstime=0x7f4fd55fcc70, clockid=, expected=0, futex_word=0x7f4fd7eae2bc) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
[0] #1 __pthread_cond_wait_common (abstime=0x7f4fd55fcc70, clockid=, mutex=0x7f4fd7eae2c0, cond=0x7f4fd7eae290) at pthread_cond_wait.c:520
[0] #2 __pthread_cond_timedwait (cond=0x7f4fd7eae290, mutex=0x7f4fd7eae2c0, abstime=0x7f4fd55fcc70) at pthread_cond_wait.c:656
[0] #3 0x00000000009cfc06 in qt_scheduler_get_thread ()
[0] #4 0x00000000009cd367 in qthread_master ()
[0] #5 0x00007f77be3abf27 in start_thread (arg=) at pthread_create.c:479
[0] #6 0x00007f77bde4987f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0]
[0] Thread 3 (Thread 0x7f4fd5dfe700 (LWP 75604)):
[0] #0 futex_abstimed_wait_cancelable (private=0, abstime=0x7f4fd5dfdc70, clockid=, expected=0, futex_word=0x7f4fd7eae1bc) at ../sysdeps/unix/sysv/linux/futex-internal.h:208
[0] #1 __pthread_cond_wait_common (abstime=0x7f4fd5dfdc70, clockid=, mutex=0x7f4fd7eae1c0, cond=0x7f4fd7eae190) at pthread_cond_wait.c:520
[0] #2 __pthread_cond_timedwait (cond=0x7f4fd7eae190, mutex=0x7f4fd7eae1c0, abstime=0x7f4fd5dfdc70) at pthread_cond_wait.c:656
[0] #3 0x00000000009cfc06 in qt_scheduler_get_thread ()
[0] #4 0x00000000009cd367 in qthread_master ()
[0] #5 0x00007f77be3abf27 in start_thread (arg=) at pthread_create.c:479
[0] #6 0x00007f77bde4987f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[0]
[0] Thread 2 (Thread 0x7f4fd65ff700 (LWP 75603)):
#0 pthread_spin_lock () at ../sysdeps/x86_64/nptl/pthread_spin_lock.S:31
[0] #1 0x00007f77be42e2ae in ofi_cq_progress () from /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/libfabric/1.12.1/lib64/libfabric.so.1
[0] #2 0x00007f77be42d76e in ofi_cq_readfrom () from /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/libfabric/1.12.1/lib64/libfabric.so.1
[0] #3 0x000000000091afd1 in gasnetc_ofi_tx_poll ()
[0] #4 0x000000000091b0d8 in gasnetc_ofi_poll ()
[0] #5 0x0000000000910c09 in gasnetc_AMPoll ()
[0] #6 0x00000000008e35e2 in execute_on_common ()
[0] #7 0x00000000008e728d in chpl_comm_execute_on ()
[0] #8 0x00000000005143c7 in chpl_executeOn ()
[0] #9 0x0000000000471861 in release_chpl22 ()
[0] #10 0x0000000000474de6 in doClear_chpl5 ()
[0] #11 0x0000000000475b5d in deinit_chpl70 ()
[0] #12 0x000000000047669f in chpl__autoDestroy42 ()
[0] #13 0x00000000005b40a9 in on_fn_chpl236 ()
[0] #14 0x000000000059ae00 in _fireAllDestructors_chpl12 ()
[0] #15 0x00000000005a1d1f in on_fn_chpl237 ()
[0] #16 0x000000000059b337 in _fireAllDestructors_chpl2 ()
[0] #17 0x000000000062663a in FillMasterTablesSingleObjective_chpl ()
[0] #18 0x000000000067d460 in on_fn_chpl283 ()
[0] #19 0x000000000067d8ae in wrapon_fn_chpl283 ()
[0] #20 0x00000000008e2db6 in fork_wrapper ()
[0] #21 0x00000000008e15ff in chapel_wrapper ()
0] #22 0x00000000009cb7c2 in qthread_wrapper ()
[0] #23 0x0000000000000000 in ?? ()
[0]
[0] Thread 1 (Thread 0x7f77bdc1c8c0 (LWP 75566)):
[0] #0 __lll_lock_wait (futex=futex@entry=0x7ffef56e5a60, private=0) at lowlevellock.c:52
[0] #1 0x00007f77be3ae883 in __GI___pthread_mutex_lock (mutex=0x7ffef56e5a60) at ../nptl/pthread_mutex_lock.c:80
[0] #2 0x00000000009c8f52 in qthread_readFF ()
[0] #3 0x00000000008e2018 in chpl_task_callMain ()
[0] #4 0x0000000000415db5 in main ()
[0] [Inferior 1 (process 75566) detached]
Spawner: read() returned 0 (EOF)
*** Caught a signal (proc 6): SIGTERM(15)
*** Caught a signal (proc 8): SIGTERM(15)
*** Caught a signal (proc 4): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)
*** Caught a signal (proc 7): SIGTERM(15)
*** Caught a signal (proc 3): SIGTERM(15)
*** Caught a signal (proc 1): SIGTERM(15)
*** Caught a signal (proc 5): SIGTERM(15)

Hi Marjan —

Most of this output is fairly boilerplate, just showing threads that are waiting for more work be scheduled for them or blocking on a sync variable or the like. So it's not immediately obvious to me what's going wrong (though I'm also not very good at reading GASNet stacktrace output).

As a quick sanity check, is this compiled either (a) without --fast or (b) with --fast --checks? (as we've discussed before, --fast by itself will turn off execution-time checks that can prevent fatal errors, but degrade performance; so if checks are enabled / fast is disabled, it might point out an error more clearly).

What did your output look like before enabling GASNET_DEBUG?

Whether running in the debugger or using the GASNET_BACKTRACE above, it can be helpful to enable debugging using the -g flag and (in my opinion), you'll also get better results if you:

  • export CHPL_TARGET_COMPILER=gnu or clang or your favorite C compiler (this will require a single re-make to rebuild your runtime for that setting
  • compile with --savec output

This combination should cause you to see .chpl line numbers in your stack traces above, which should point to which line in FillMasterTablesSingleObjective the error is occurring. However, it probably won't help much with most of the other stack frames since lots of them are pointing to internal code.

Another thought: I believe you're mostly running on a true cluster, right? It's also possible to run multiple Chapel locales on a single node (say, a laptop or favorite workstation), and if the bug shows up there as well, it can be more convenient to debug. Instructions for doing this are here: Using the Portable UDP Conduit — Chapel Documentation 1.26 (where I use the GASNET_SPAWNFN=L setting typically).

If you can reproduce it on your workstation, it may be useful to run your Chapel program within gdb with debugging enabled. There are some tips on doing this here:

The slight advantage of running within gdb is that it may make it easier to sort out which thread is hitting the error and where (probably the output above does as well, but I'm just not at good at reading it? Others on the team may be better at it).

Lastly, if the above doesn't get you anywhere, and your stack traces are fairly consistent from run to run, I often find it easier to nearly as easy to debug by just putting debug writeln()s into my code and using them to hone in on where the error is occurring, similar to a binary search. I realize that sounds lame, but it can often simpler than trying to understand what each thread is doing.

We have a long weekend this weekend (maybe you do too?) so may not be back online until Tuesday,
-Brad

Hi Brad,

Thank you so much for your reply. Regarding this issue, I'd also reached out to GASNet team and they told me -->

Stack frame 9 of thread 5 shows the SIGSEGV occurred below a call into GASNet from the function "AM_reply_put" which is in the Chapel runtime library. This is consistent with Chapel having passed a bad address to GASNet, though possibly as a result of itself having received a bad address from the application code. Since I am not a Chapel user or developer, I encourage you to use any of Chapel's support channels to get further assistance.

The answers to your questions:

  1. The Complied program was without --fast.
  2. Without GASNET_DEBUG I was receiving only a fatal signal with: Spawner: read() returned 0 (EOF) error "again". The GASNet team told me -->

That is an error message from GASNet, but not from the signal handling logic. It generally means the remote end of a communications socket was closed as a result of a process exiting. Based on the information you have provided, my best guess is that the EOF message is printed due to the unexpected exit of one or more processes which have received a fatal signal. In other words, the fatal signal has probably caused this message as a side-effect.

  1. Thank you, I'll add these variables and flags to see if I can catch the line which throws this error. I believe that there is a race condition going on on the local 0, and even though I had been using writeln() method, I was not able to catch the line where the data race is occurring.

  2. Yes, I'm using a true cluster. In fact, with 2 locales (which is as same as the general single node). I don't see the error, which assures me that there is a data race condition.

Thank you for all the suggestions. I will apply them to see if I will be successful to find the problematic line(s).
Marjan

1 Like

Hi Marjan —

Let us know if you get any more data that would be helpful to understanding what's going on. I don't know whether it's possible for you to share your code and enough data for us to reproduce the problem locally, but if so, please feel free to do so.

-Brad

Thank you Brad for your suggestion!
I realized that updating/appending variables on Locale0 from other locales is so so risky and ends up at data race conditions or deadlocks. In fact, I had a variable on Locale 0 which is a list and assume it's called "ParameterSets". This ParameterSet is updated frequently on Locale 0 and it's read in a while loop on other locales. Since there is no sequential tasks, there might be times that the read/update occurs at the same time and if I have sent the variable to locales with

coforall ... with (ref ParameterSets)

the program crashes. So, I have tried to replicate (using Chapel's module) ParameterSet on all locales and set a lock on Locale 0 --> lock on when the ParameterSet is updating on Locale0 and all other locales can read that lock and if it is not full they don't read the replicated variables. But the replication takes for ever to complete, when more memory is consuming on the locale (first runs it's great, but from 30 runs onwards it takes a long time). I tried the on statement to update the value of the remote variables but its not working. So, I am really having a problem how I can "send" a variable to remote locales (I have tried setter functions with on statement and it's now working as well). I have tried a bunch of ideas but all of them degraded the performance. So, I appreciate if you let me know any ideas about " how to replicate/send/copy one variable which resides on Locale 0 to other Locales (again references end up at deadlock)
I also realized another thing about sync variables.
So, I used sync variables to coordinate some works among locales. But unfortunately, the live locks started to appear. I'm 99% sure that I had used these variables correctly. In fact, on each locale I had a Begin statement which was locked until on the Locale[0] the lock was released by another locale. So, in each Begin statement I had

begin{
     on Locales[0] {
          var Lock = MasterTablesAvailable$.readFE()
          // do works
          MasterTablesAvailable$.writeEF(1); 
       }
}

I had also try- catch to to fill the lock in case that the body fails and the lock is empty. But after 20/30 runs, live locks appeared either before or after reading/writing the lock.

Thanks again for all help,
Marjan

I am using the following codes, but at some points the live locks start.

var SharedLoacaleData: list(LocaleData);
var MasterTablesAvailable2$: sync int = 1;
proc SetLocalesParallelData(DBClass: ProgramDatabase){
   var Lock = MasterTablesAvailable2$.readFE();
   // modify SharedLoacaleData
   MasterTablesAvailable2$.writeEF(1);
}


// Excluding Locale [0]
coforall loc in 0..MyRemoteLocales.size-1 with (ref SharedLoacaleData, ref MasterTablesAvailable2$) {
               on MyRemoteLocales[loc] do{
                      begin with (ref MasterTablesAvailable2$, ref SharedLoacaleData, ref LoacaleData) {
                                     var Lock = MasterTablesAvailable2$.readFE();
                                     if (SharedLoacaleData.size >= here.id){
                                        LoacaleData = SharedLoacaleData[here.id-1];
                                    }
                                     MasterTablesAvailable2$.writeEF(1);
                        }
                       // rest of works 

UPDATE: I am observing that this live lock is because one node has got the key and then it is super slow or "off" and not engaged in work any more ( I don't see printed results from some nodes after I see the live lock ) . So, is there any way to handle this condition that when a node is off we not assign it the lock? like getting any signals from locales that they are alive?