How do I define locales

Assuming I have 3 dual-CU systems, tom, dick and harry.

I want to partition each of these dual CPU (each with 26 cores) into 8 logical tasks, each with 6 cores (which means 4 cores are unused). I want 4*6-core tasks per CPU?

That means I will have 12 locales (I hope),

How do I tell Chapel this is what I want?

Now let's add one (or two GPUs)? What else is required?

Happy to be pointed at the right place in the documentation?

Thanks - Damian

Unfortunately, what you want to do isn't easily done.
Multilocale Chapel Execution — Chapel Documentation 2.2 has more information on creating co-locales. "-nl 3x8" is close to what you want, but it will put all four unused cores on the second socket instead of two on each socket. I didn't put much effort into efficient partitioning when there are unused cores, figuring this wouldn't be a common occurrence and in general it's a difficult problem. However, if you launch your locales manually and bind them to cores yourself the Chapel runtime will use the cores it's given. With slurm, for example, you could use the --cpu-bind argument to bind each locale to the cores you want it to use.

John

If I use "--nl 2x13", will it guarantee that each 13core locale is on a single CPU.

My experience, at least with MPI, is that partitioning the 52-core machine into 8 threads each with 6 cores yields better performance than 4 threads of 13 cores. Your algorithm results in one of my threads with 2 cores running on one CPU and 4 cores running on the other. Undesirable.

How do I launch locales manually? Can you point me at that documentation? Thanks.

The syntax is "-nl NxL", where N is the number of nodes/systems, and L is the number of co-locales per node. As I understood it, you have three nodes, each of which has two CPUs/sockets, and each socket has 26 cores. So if you want one core per locale (which I do not suggest doing) you would use "-nl 3x52". That will give you a total of 156 locales across the three nodes.

John

If you use "-nl 3x4" each of the three nodes will have four co-locales and each co-locale will have 13 cores. Each locale's cores will likely be confined to a single socket, although it's somewhat dependent on how the OS numbers its cores. Like it will be a problem if all the odd cores are in one socket and all the even in the other, but I haven't seen that happen.

As for launching manually, (assuming you are using slurm), run your program with -v and you will see the command used to invoke srun. Copy that line and modify it to add a ---cpu-bind argument. If you use --cpu-bind=cpu_mask: you can provide a comma-separated list of CPU bitmaps, one per locale, that specifies which cores the corresponding locale should use. See the srun man page for more details.

John

1 Like

Got it.

Looks like I could have "-nl 3x4" which means 13 cores per thread. Does that guarantee that each 13 core thread sits on one CPU?

Then the question is how I get "-nl 3x8" ensuring that each 6 core thread is on a single CPU? Where do I look to learn how to manually start threads.

Hi Damian —

What processor types do these nodes have? And what network is connecting
them?

Without that information, I'd be inclined to try running using -nl 3x2
(3 nodes with 2 co-locales per node, one per CPU) and see how that works
for you since it will use all resources, match the logical 2-CPU structure
of the nodes, and avoid overdecomposing the nodes (given that Chapel uses
a thread to progress communication). Chapel and MPI are different enough
that it's not obvious to me that what has worked well for MPI will
necessarily work best for Chapel, so I'd start with 3x2 as "the most
natural thing" for Chapel.

Of course, you may know more about the algorithm and reasons for why that
numbering "makes sense" than I'm picking up on here.

-Brad

I have either Xeon e5-2650v4 across 10Gbps ethernet or Xeon Gold 6230R across 10Gbps ethernet.

The reason why I wanted to know the details is to ascertain how partitioning the CPU affects the performance. Certainlt it helps big time with MPI.

The (linear algebra) algorithms I am playing with scale nicely up to 4, drop slightly towards 8, drop off even more towards 12, and often bottom out about 24 cores per locale. So I would guess that "-nl 3x2" is going to be bad and "-nl 3x4" is minimal. Whether "3x8" is an significant improvement (like it is with MPI) is the big question although "3x12" probably has too much
inter-processor communication overhead.

I do not know the answer to the questions with Chapel but I need to know enough to be able to run the experiments to answer those questions. It would be good to know that Chapel is smart enough to optimally lay out the cores in a single thread on the same CPU (such as should ideally happen with "3x8" but which seems not to be the case currently).

Not urgent. I hate MPI with a passion in case I gave the wrong impression.

1 Like