Use the Local Machine in the Multi-Locale Run

Hi Chapel community,

I am trying to run Chapel in Multi-locale mode. I have three Linux systems, and I am using ssh/udp/gasnet. I am compiling and working on one of these machines, and I have tried to add its IP to GASNET_SSH_SERVERS and SSH_SERVERS since I want the "local" machine also to be one of those that run the program in a distributed manner. But when I try to use this machine --> -nl 3, GASNET gets stuck and I have to terminate it. So, I am observing that the local machine is not participating, since no GASNET doesn't use the MASTERIP as the WORKERIP as well. So, just to confirm, if we are using Multi-locale, the "local computer" on the network, where actually we compile the program, does not participate in the Multi-Locale run?

Best,
Marjan

Hi Marjan —

I don't think there should be any inherent problem with using the local machine to run a worker, though I admit it's not very often that I do it (so am not confident). Today's a holiday, so most of us at HPE are not working (myself included), but in the meantime, could you send the output of running your program with the --verbose flag? Also, if you only list the local machine in your GASNET_SSH_SERVERS variable and run -nl 1 do you have any success?

Thanks,
-Brad

Hi Marjan —

Back at work, I gave this a try today, and it worked for me—but obviously a person's experience could vary widely depending on their system configuration and environment. Here's a quick review of what I did:

  • export GASNET_SPAWNFN=S
  • export GASNET_SSH_SERVERS=host1 host2 host3 # where host1 was the one I'm on

And then these are settings that I always use with GASNet that may have no specific effect on this experiment:

  • export GASNET_QUIET=Y
  • export GASNET_ROUTE_OUTPUT=0

My key settings from printchplenv are:

CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: everything
...
CHPL_LAUNCHER: amudprun

And I'm able to compile and run with no problems:

$ chpl test/release/examples/hello6-taskpar-dist.chpl 
$ ./hello6-taskpar-dist -nl 3
Hello, world! (from locale 0 of 3 named host3-0)
Hello, world! (from locale 2 of 3 named host1-2)
Hello, world! (from locale 1 of 3 named host2-1)

Note that I am able to ssh manually without a password between any of these hosts.

If I run with --verbose I get a very long line, some key components of which are:

/path/to/amudprun  -np 3 /data/cf/chapel/bradc/chapel/hello6-taskpar-dist_real -nl 3 --verbose [lots of environment stuff here]

So it is possible, and I think the next step in the process is to see what your verbose output looks like. And it seems appropriate, to see whether we can turn on some verbose output for GASNet's amudprun itself.

-Brad

Thank you Brad, I appreciate it a lot. I’ll give it try with all these settings tomorrow and will see if it works. It not, I’ll share the output of the verbose here, and appreciate further help.

Hi Brad, I tried to compile and run the program with all the settings you had suggested. I additionally added export GASNET_MASTERIP=local and GASNET_WORKERIP=192.168.0.0, based on previous advice I had got here. But it didn't run and here you can see the output of verbose! I appreciate any suggestion to resolve this issue.

If you are using a recent Ubuntu you might be running in to intermittent gasnet error in local UDP configuration · Issue #18186 · chapel-lang/chapel · GitHub . I'm able to workaround the problem using export CHPL_COMM_SUBSTRATE=smp (you have to make again after that). But that won't help you run across multiple machines.

Thank you so much, I did what you have suggested. But it didn't solve the problem. I just did it on my Master node though. The same error did show up again.

Can you tell us which OS version you are using?

Yes, of course. Ubuntu 21.04

Hi Marjan —

Would you try:

export CHPL_COMM_DEBUG=1
cd $CHPL_HOME && make

and then re-compile and re-run your program? This will re-build GASNet with debugging enabled, and I'm hoping will give us more information.

Thanks,
-Brad

Thank you so much, Brad. I think last time it had not been compiled properly. It seems it is working. However, I'm having a hard time figuring out how I can parallelize locales, as you see on this screenshot.
I have used:
for loc in Locales{
writeln(here.id:string);
}
But it doesn't run as the way I expect. The general question is how I can parallelize Locales? if I use forall, it throws shadow variables errors. Thank you so much!

Hi Marjan —

Chapel doesn't migrate tasks from one locale to another unless you use an 'on-clause', so in order to have your computation go across multiple locales, you'd need to do:

for loc in Locales {
  on loc {
    writeln(here.id:string);
  }
}

I'm also curious whether you still have CHPL_COMM_SUBSTRATE set to smp which, as Michael said, will prevent you from running across multiple machines. If you do, I'd suggest unsetting it (export -n CHPL_COMM_SUBSTRATE) and then re-making Chapel, recompiling your program, and re-running.

-Brad

The problem of this for loop here is that it runs the locales in a sequential manner kind of. I have a while loop in my program, and I see since the while loop is not done on one locale, it doesn't start the job on the other. Is it true, or I'm making mistakes somewhere?
That is true, what Michael suggested is the solution! Thank you, Michael! without this setting, I see that error again!

So to be clear: With Michael's workaround, if you're running on a 100-node cluster, you will not be able to use 99 of the nodes because the smp option only supports shared memory. So I think you should disable that setting, add the CHPL_COMM_DEBUG=1 setting, rebuild, recompile, and hopefully you should get more debugging output from GASNet than in the original run that generated an error.

In Chapel, the user's program starts running as a single task on locale 0, so the only way to start using other locales is to use on-clauses directly (as in the code I sent) or to use an abstraction (like a distributed array) that uses on-clauses as part of its implementation. A classic way to spin up a task on every locale is:

coforall loc in Locales {
  on loc {
    writeln("running a task on locale ", here.id, " named ", here.name);
  }
}

-Brad

Thanks for the alert. I will work on the issue using the way you suggested.
A question about the coforall --> If the job fails on one of the machines, I am observing that the program stops on all machines. Is there any way to force the program to continue the work even though one of the machines fails?

Is there any way to force the program to continue the work even though one of the machines fails?

Not at present, no. Chapel's global address space and the underlying technologies on which we rely (like GASNet) don't deal well with machine failure.

-Brad