Multi-Locale Chapel

Hi everyone,

I was trying to run a chapel program on two locales which are Linux systems. The below error was shown to me. I am using gasnet with udp conduit and ssh. I have installed gasnet without errors and I can do ssh from one machine to the other machine. I simply don't have any idea what this error means. Regarding udp, I am not sure if I need to do TCP to UDP forward on my machines or not? (I have not done it). In addition, my LAN system has no internet connection. Thanks in advance for any help.

Hi Marjan —

These errors aren't familiar to me, but suggest to me that something is not configured correctly on your nodes. The continual warnings/errors about "rsaauthentication" seem concerning, but it looks like you may be in the habit of just ignoring those? I don't know whether those are related to the "worker failed DNSLookup on master host name" error, which looks like the heart of the matter to me (suggesting that the process running on your remote node isn't able to look up the master node's IP address via its hostname?). I'm afraid my own UNIX networking skills aren't good enough to know what to suggest, though.

My best suggestions for digging into it a bit more are those listed here, in the Troubleshooting job launch section.

I've put out some feelers to see whether someone else can suggest a better diagnosis or path forward based on your output, but that may take a few days (unless someone else beats them to it).

Sorry not to know the answer offhand,
-Brad

Marjan —

I may have replied a moment too soon. After posting this, a colleague on the GASNet team reported:

The user probably needs to set this envvar: (from conduit docs)

  • GASNET_MASTERIP
    Specify the exact IP address which the worker nodes should use to connect
    to the master (spawning) node. By default the master node will pass the
    result of gethostname() to the worker nodes, which will then resolve that
    to an IP address using gethostbynname().

If that fails, please ask them to provide output from a launch attempt using env GASNET_VERBOSEENV=1

-Brad

I do really appreciate the help, Brad. In fact, the problem got solved by setting the GASNET_MASTERIP environmental variable. However, I have followed the steps defined on the GASNET doc you had suggested. But now I have no idea about the error. I enormously appreciate anyhelp in this regard.

The solution was setting the GASNET_WORKERIP to 192.168.0.0
Thank you, Brad! I am now able to run my chapel code on multiple computers without any error!!!

1 Like

Great, glad to hear it Marjan! Just to make sure there aren't surprises down the road: The GASNet over UDP option is a great way to develop Chapel programs in a portable manner, but it is not a high-performance configuration by any means—primarily because Chapel and GASNet rely on good network support for RDMA to perform well, and UDP / ethernet don't provide that. Ultimately, to get good performance and scalability with Chapel, you will likely want to be running on a more capable network that supports RDMA like InfiniBand.

Best wishes,
-Brad

Hi Marjan,

I should note, that GASNET variable is gasnet-specific. We have a more
general variable that may help with related Chapel-specific features,
CHPL_RT_MASTERIP and CHPL_RT_WORKERIP. More information about those
environment variables can be found at the following link:
https://chapel-lang.org/docs/latest/usingchapel/launcher.html#chpl-rt-masterip

Thanks,
Lydia

Thank you Brad for letting me know about the network that can provide the highest performance for Chapel. In fact, this LAN with gasnet-udp cluster is a test cluster for running the program that I will be coding and testing parallelization ideas. At the end of the day, the goal is to run the final Chapel program on one of the Compute Canada clusters. I will make sure to go with a network that supports RDMA.

Thank you so much Iydia for providing me a source website to read more about the Chapel-specific variables. I appreciate it!

1 Like