CHPL_RT_NUM_THREADS_PER_LOCALE variable

Hi Everyone,

I have a question and I really do appreciate your opinion on this.
I'm running a heavy hydrological model in my Chapel code. The model should be run by opening a sub-process using spawnshell since it's a file. I have noticed that running it this way (Chapel --> Spawnshell) increases the running time significantly compared to when I run it on Windows systems or directly on Linux systems (outside Chapel). I suspect that it's because of running in the sub-process? I wasn't able to set the number of allocated cores to spawnshell window.
I haven't set any Chapel environmental variables since I'm using Compute Canada and Chapel modules there. However, recently I've come across the CHPL_RT_NUM_THREADS_PER_LOCALE" variable. Do you think setting the following variables will help me?

export CHPL_TASKS=fifo
export CHPL_RT_NUM_THREADS_PER_LOCALE="MAX_LOGICAL"
export CHPL_RT_NUM_THREADS_PER_LOCALE_QUIET=yes

Thank you,
Marjan

Hi -

Yes, the default tasking layer (qthreads) tends to use a lot of CPU resources (even when idle), and the process you have spawned will compete with it.

Using CHPL_TASKS=fifo will avoid the problem. Another thing to try would be export CHPL_RT_OVERSUBSCRIBED=yes as described here -- Executing Chapel Programs — Chapel Documentation 1.26 .

(Edit -- I don't mean to say that this is necessarily the problem - just that it is worth trying to run it with these to see if it addresses the problem.)

Best,

-michael

Thank you so much! I'll try them both

HI Marjan —

Tagging onto Michael's response belatedly, I wanted to add a few things:

I think that your thought to use CHPL_RT_NUM_THREADS_PER_LOCALE="MAX_LOGICAL" will probably hurt performance rather than help it. Specifically, if you're on a 4 core system with hyperthreading, Chapel's default when using CHPL_TASKS=qthreads will typically to be to use four threads, one per core, whereas using this setting will cause it to use 8 (2 per core to reflect the hyperthreading). So if the problem is that Chapel's threads are degrading the performance of your external routine, I don't think adding more threads will help. If anything, I might try dropping the number of threads per locale by 1 or 2 from the default to see what happens (where my guess might be that the performance of any Chapel related code would degrade slightly, but that the externally spawned code should run faster if the OS schedules it on the idle cores?)

I also wanted to tag @ronawho on this as he's our resident expert on threading and tasking in Chapel, so may have some other ideas.

I also wanted to ask some more questions about your use case to understand it, such as:

  • is the hydrological model itself single-threaded or multi-threaded (or even distributed memory)?
  • are you running a single instance of the model at a time using a single Spawnshell, or multiple?
  • are Chapel and the hydrological model ever executing simultaneously, or is it more like they take turns running?

Lastly, I wanted to mention that this issue of Chapel threads and threads of external programming models interfering with one another is something we've seen in a few other cases and will ultimately need to address in a more satisfying manner (which is why I ask some of these questions). A specific example that comes to mind is here:

Thanks,
-Brad

Thank you so much, Brad, for the help.
You're saying "but that the externally spawned code should run faster if the OS schedules it on the idle cores" --> I was thinking about it; to limit the cores that can be used by the Chapel program. but in Chapel, we can't explicitly set the number of cores to be used by the program; The program uses as many cores as available, right? so, by idle cores, if you mean the cores that are not engaged in the chapel execution program, there are not many in my case. I am using shared computers, so each time, I receive, for example, 5 cpus per computer. As soon as, I run Chapel, all of them are getting engaged in running the program.
The problem and the main challenge that I have now with Chapel is this Spawnshell, since the model I am running is actually that time-consuming part of my project for which I've switched to parallel computing.

  • The hydrological Model is multi-thread itself.
  • On each Locale, I prepare input data, open a Spawnshell (one single instance on each Locale), execute the model (and wait until it's done) and get back to the chapel program and read the output databases. So, basically, when the spawnshell is open, nothing is running in chapel except the Spanwshell code (it's not parallel with any tasks). So, I guess I would say that yes they take turns.
  • I was wondering if there is any control on the shell that spanwshell command opens in chapel? I was communicating with research support (Chapel) in Compute Canada Team: They told me "I don't see any documentation on how spawnshell gets resources, e.g. how many threads it can launch. I suspect by design the goal of spawnshell is to quickly run a Unix tool and capture its output from within Chapel, so I won't be surprised if it does not do any multi-threading at all. I just don't know. Or maybe it gets treated as a normal Unix process that can launch multiple threads, up to the available number of cores on the system -- on the cluster this is controlled through Slurm's cgroup settings (so that you do not exceed the number of cores allocated to your job)." So, any info would be appreciated.

Once again, thank you so much for the support!
Marjan

Hi Marjan —

No, that's not quite right—it will only use as many cores as are available by default or if you ask it to. For example, if you were using CHPL_TASKS=qthreads on a 4-core system, setting CHPL_RT_NUM_THREADS_PER_LOCALE=2 would cause Chapel to only use 2 cores, leaving 2 free. We don't typically suggest users change this default, but in your case, it would be interesting to see whether or not it reduces interference with the external processes. As I mentioned before, the main downside would be that when Chapel code was running, it would only get to use half the cores. But if the heavy-lifting is in the external code in your case, that might be appropriate. Except that you then say:

The hydrological Model is multi-thread itself.

Oh, that's interesting, and may reduce the value of my experiment above. Specifically, if the hydro model was single-threaded, you could imagine it using one of the unused cores if Chapel left some open. But if hydro wants to use all four cores, Chapel would still be interfering with two of them, which would likely still hurt performance.

I would say that yes they take turns.

This is encouraging, though we don't have a great solution for it today. Specifically, we've talked about having a mechanism for Chapel's threads to be put to sleep before calling into an external routine (or, in this case, the spawnshell()), and then waking them back up after that section was done. But this feature isn't provided today. However, you may be able to approximate it as follows:

After I sent you the previous response with the link to the related issue I was remembering, I re-read it and noted that it had some suggestions that might help as well. Note that these are specific to CHPL_TASKS=qthreads, so if you've switched to CHPL_TASKS=fifo as suggested in your original message, you may want to switch back. Specifically, it suggests that if you set these in your environment:

export QT_AFFINITY=no
export QT_SPINCOUNT=300

then Chapel's interference with your hydro model may be reduced. If that does help, but hurt your Chapel code's performance significantly, this comment had some prototype code that you could call just before and after your spawnShell() call to reduce the interference just for that part of the program. Note, however, that I haven't tried this code to see if it still works, so this is slightly uncharted territory (though territory that Elliot may be able to help with when he's back next week).

I think your / Compute Canada's interpretation of spawnshell is right. Specifically, I think of spawnshell as being like C's system() command or typing a command at your shell's command prompt, where that process can do whatever it wants and Chapel has no real control over it or effect on how it runs. Except that, since the Chapel program is still running, it can have performance interference effects as you're seeing.

Asking an obvious but potentially annoying question: Is the hydro code so big that it would be a lot of effort to translate it into Chapel and avoid the need for mixing two computational models? :smiley:

-Brad

Thank you very much Brad for the detailed info. I do enormously appreciate all the help.

The information that I can limit the cores allocated to chapel was promising. Also, I think the situation wouldn’t be worse than the current situation. Right now, the model is only using the same core(s) as chapel is using (bad for both chapel program and my model). When I leave 2/3 cores free on each locale, then it uses them “along” with chapel cores (same as before for chapel but better for the model). I’ll also apply the affinity idea with the codes you have suggested to see any potential changes in the performance.

The idea of putting chapel threads to sleep is awesome; I’d been looking at chapel documentation to see if this could be done with the current version because in my chapel program for example I’m also working with databases (outside the chapel program using c#); then if I put chapel threads to sleep whenever I switch to c# and databases will be helpful to enhance the performance.

Thank you for the info about the Spawnshell as well.

Unfortunately the model cannot be “translated” into chapel language. It’s super big and complex and written in c. The process of transferring it from windows config to Linux config took a huge time. So, I wish it was possible but it’s not.

Thank you so much again,
Marjan

1 Like

Let us know what you find, and we'll have Elliot give us his thoughts about options for putting Chapel threads to sleep when he's back next week (not that I expect an easy solution, but... it's good to have users asking for things that would be useful to them).

Have a good weekend,
-Brad

Of course, I’ll update you here.
Looking forward to knowing his ideas next week.
Thank you again and have a nice weekend as well.

Marjan

Hi @bradcray,

I can’t thank you enough for the solutions you gave me yesterday. It’s crazy that by applying those suggestions the model is running so fast on each node and the performance of my Chapel program hasn’t degraded (haven't seen for a couple runs I have tested). Nothing could have made me as this much happy since it really impacts my work in a positive way and THANK YOU very much for supporting Chapel programmers.

I had computers with 8-core CPUs. So, as the first solution was only using CHPL_RT_NUM_THREADS_PER_LOCALE=3

to let 5 cores be free for the Hydro model. This solution “alone” didn’t help. So, I added all the following lines in my bash file before the command for chapel model execution (didn’t change any other default settings).

  • export CHPL_RT_NUM_THREADS_PER_LOCALE=3

  • export QT_AFFINITY=no

  • export QT_SPINCOUNT=300

And my model is running really fast on each locale, now! I'd appreciate if you also give me brief info about what those QT variables actually do.

Thank you a million,

Marjan

Hi Marjan —

I'm glad to hear that we were able to get your code working more to your satisfaction. It might be useful to have you comment on the issue linked above with a note about how much of a difference this made for you (if that would be easy to quantify), just to capture more experiences and evidence of its importance on that issue.

To your question:

I'd appreciate if you also give me brief info about what those QT variables actually do.

First, in looking this up myself to make sure I had the details right, I found that I think that rather than setting these two QT_ variables, the current preferred approach is to set the following single variable:

export CHPL_RT_OVERSUBSCRIBED=yes

This should have the effect of making the two QT_ settings above under the covers (and has the benefit of keeping with just using CHPL environment variables). If you could verify that this gives you similarly good behavior, that'd be nice to know (and if it didn't, that'd be important to know as well).

Documentation for this setting is here, though it focuses more on the case of running multiple Chapel processes per node rather than calling out to external processes. The net effect is similar, though.

Answering your question, though:

  • QT_AFFINITY: I'm fairly certain that this has the effect of asking Qthreads not to pin the threads that it uses to the cores, which makes them better at sharing the processor cores with other threads/processes
  • QT_SPINCOUNT: When Qthreads are idle and have no tasks to run, they spin looking for work for awhile, and then if they don't find any, they go to sleep on a condition variable that will wake them up when more work arrives. This variable says how many times they should spin before going to sleep, where the default is something like 300k. So setting it to 300 effectively says "go to sleep much faster if nothing seems to be going on."

-Brad

Hi Bard,

Thank you so much for providing the documentation link and the explanation of QT variables.

In fact, once I tried without export QT_SPINCOUNT=300, and only CHPL_RT_NUM_THREADS_PER_LOCALE and QT_AFFINITY=no and it worked as well.

In addition, I used the two following variables and Yes, CHPL_RT_OVERSUBSCRIBED has the same impact in my case.
export CHPL_RT_NUM_THREADS_PER_LOCALE=3

export CHPL_RT_OVERSUBSCRIBED=yes

I just left a note there and thank you for this suggestion.

Best,
Marjan

1 Like