Chapel's programming model and features

I want to cite Chapel in a paper and try to describe and understand Chapel's programming model and features clearly. Could you please confirm my understanding and correct me if there is anything wrong?

  1. Chapel supports multi-node. Each node is called a locale, and the number of locales can be accessed by ChapelLocale.numLocales

  2. Chapel supports multi-core CPU. The number of CPU cores can be accessed by ChapelLocale.locale.numCores

  3. Chapel supports GPU and multiple GPUs per node.

  4. Users can easily change whether the task will execute on CPU or GPU.

  5. Chapel lets users express how data is distributed using Domain map Standard Interface
    The same task will perform computation on each chunk of data. Chapel does not provide a way to let users express a do bulk task launches and then decide which subtask should go to which locale. In Chapel's programming model, the task's logic is the same across locales, and the only difference is the data.

  6. Chapel lets users decide how to map data to memories by changing "CHPL_GPU_MEM_STRATEGY". But changing it needs recompilation of Chapel.

GPU Programming β€” Chapel Documentation 1.32

On GPU, there is framebuffer memory and zero-copy memory. But Chapel does not seem to support zero-copy memory and provides a mechanism to users to specify whether to put data onto zero-copy or not.

  1. Users in Chapel do not have a way to express memory layouts for arrays (row-major, column-major for axis ordering mapped onto physical memory), or layouts for structs (Array Of Structs, or Structs of Array)

  2. Chapel provides a data structure DistributedBag for load balancing.
    In Chapel, there is no general interface for expressing load-balancing, so users will need to write the load-balancing policies entangled with the application code.

  3. Chapel provides a way for users to express garbage collection for the data in memory:
    Base β€” Chapel Documentation 1.32

  4. Chapel, at least in its current form, does not provide a way to let users decide the scheduling policy (e.g., when to run which tasks).

Many thanks!

Hi Anjiang,

I will try to answer questions as best I can (and with more weight on the GPU side), others should feel free to elaborate.

Yes. Note that ChapelLocale is an internal module whose name doesn't matter to the user. numLocales suffice.

Yes. See ChapelLocale above. locale here is the type that represents locales in the language. You can query number of cores on each locale, where locales can be obtained by different means:

  1. here will always be the locale that you're currently executing on
  2. Locales is an array you can index from anywhere. eg, Locales[2] will be the 3rd locale (0-based arrays) based on some enumeration of locales
  3. myVar.locale will be the locale where myVar resides.

There maybe other means that I am missing. Maybe obvious, but results of these can be stored in any variable which would end up having locale type. e.g., var myLocale = here;, from then on myLocale.numCores is a valid expression.

Yes.

Yes. Putting a block of code in an on statement that targets a GPU locale is all that it takes for GPU-based execution/allocation to take place.

Technically, DSI is a more advanced interface that a typical user doesn't interact directly with. It describes how one can create a custom distribution. Chapel releases contain several distributions including but not limited to Block, Cyclic and BlockCyclic. These distributions implement DSI for them to be able to be used as distributions on Chapel arrays.

Not following this one. Maybe silly but, coforall tid in 0..<nTasks do on Locales[tid%2] would put even-numbered tasks on Locale[0] odds on Locale[1]. So, probably things are more relaxed than you're thinking.

Yes. As of 1.31, if you rebuilt with both strategies, you don't need to rebuild everytime you change it. We don't override the runtime builds based on this environment.

I don't know what framebuffer memory is on a GPU. I looked it up. It may be a software/hardware cache for displaying on a screen. In HPC, it is not a term we use. Chapel allocates data on the GPU's global memory. We currently have no user-facing means of allocating on texture or constant memories, though internally use constant memory on AMD GPUs (in an admittedly haphazard way, but that's too much in the weeds).

Whether an allocation is zero-copy is a different story. By default, we allocate on unified memory but we are working towards stopping doing that. In that mode, the pages are moved between device and host automagically by the underlying implementation. I believe this mode also has zero-copy semantics where using things like cudaMemcpy results in a single copy between the device and the host. The non-default CHPL_GPU_MEM_STRATEGY=array_on_device mode OTOH, takes extra steps to make sure that host allocations are registered with the underlying implementation such that any data copies involving the host doesn't result in an extra copy. e.g., if you're copying malloced data into a cudaMalloced memory, the underlying implementation would need to do a copy on the host side to make sure that the data is on addressable pages on the host. The registration on the host side makes sure that the data is already in the addressable memory.

The short-answer is, data movements between the host and the device are supposed to be zero-copy, or maybe more precisely no-redundant-copy.

There is an (undocumented?) defaultStorageOrder compilation flag which you can set as -sdefaultStorageOrder=ArrayStorageOrder.CMO to make all Chapel arrays column-major. As of yet, there's no fine-grained control over this.

"Layout of structs" sounds like a different thing than AoS vs SoA. The latter should not be a language concept and the programmers can choose to use either using typical language means. For the former, I am not sure we give any guarantees at all. I am almost certain we don't add fields in a way that can put gaps between user's fields. And my strong guess is that a structs fields are always in the order as they are laid out by the user. Not sure if these are guaranteed by the specification or the current Chapel compiler, though.

Probably one of the ways, yes. Note also that it is a Package module, which typically receives less attention design and implementation-wise compared to Standard modules.

Definitely can be improved but take a gander at DynamicIters β€” Chapel Documentation 1.33.

That's a compiler internal that may be outdated. See Classes β€” Chapel Documentation 1.32 for class memory management. Array allocations are typically freed at the end of the lexical scope.

This sounds right.

1 Like

Thank you so much, Engin! Your reply is really helpful :grinning: :grinning:

Hi Anjiang / all β€”

Coming to this thread late, I wanted to add a few more details:

[quote="Anjiang, post:1, topic:24032"] Chapel supports multi-node. Each
node is called a locale, and the number of locales can be accessed by
ChapelLocale.numLocales [/quote]

Yes. ...

Though this is typically the case and a good mental model, we can be
slightly more precise or vague: On the vague side, a locale is just a
unit of the target architecture with processors and memory; on the more
precise side, it's a process in practice and the resources that that
process is bound to. Recent work has been adding a mode in which a locale
can be created per NIC or socket on a compute node β€” so slightly
finer-grained than the traditional locale-per-node model.

Also good to know about is locale.maxTaskPar, which will give the number
of tasks that the locale is capable of running concurrently. Typically,
this is the number of cores (particularly when running a locale per node),
but if the OS or another user setting has limited (or oversubscribed) the
number of threads, you'll get a different answer. When deciding how many
tasks to create, maxTaskPar is generally the preferred practice over
numCores().

Technically, DSI is a more advanced interface that a typical user
doesn't interact directly with. It describes how one can create a custom
distribution. Chapel releases contain several distributions including
but not limited to Block, Cyclic and BlockCyclic. These distributions
implement DSI for them to be able to be used as distributions on Chapel
arrays.

Engin's answer is correct. Putting it a different way, I'd say that
Chapel lets users control how data is distributed using domain maps (or
"distributions for short"). The domain map standard interface is more
about authoring your own domain map than how a typical user would specify
how an array is distributed.

Users in Chapel do not have a way to express memory layouts for arrays
(row-major, column-major for axis ordering mapped onto physical memory),
or layouts for structs (Array Of Structs, or Structs of Array) [/quote]

There is an (undocumented?) defaultStorageOrder compilation flag which
you can set as -sdefaultStorageOrder=ArrayStorageOrder.CMO to make
all Chapel arrays column-major. As of yet, there's no fine-grained
control over this.

Adding to this, nothing in the language prevents users from creating array
layouts that are CMO, tiled, use space-filling curves, etc. That said, it
requires writing your own domain map ("layout") which is not a very
well-documented task. Providing finer-grained control over the CMO layout
that Engin mentions above would not be a particularly difficult task, but
hasn't been one that any users have requested (that I know of), so it has
also not received any attention.

On the AoS vs. SoA question, at times we've discussed whether Chapel's
support for adding direct access methods and default iterators to records
was sufficient to make these kinds of choices without changes to the
"science" operating on the logical data structure, but that was years ago,
and I don't remember where it fell on the "rock solid" vs. "stunt" scale.
If you're aware of a language that has good support for changing from AoS
to SoA effortlessly, I'd be very interested in hearing about that, to
learn from it or see whether we could do the same thing.

Probably one of the ways, yes. Note also that it is a Package module,
which typically receives less attention design and implementation-wise
compared to Standard modules.

Some of our users have created their own, improved DistributedBag
recently, which I hope will make it into the packages directory at some
point: `DistBag_DFS`: our revisited version of `DistBag` for depth-first tree-search Β· Issue #21958 Β· chapel-lang/chapel Β· GitHub This was also
covered in their CHIUW talk. See "Towards a Scalable Load Balancing for
Productivity-Aware Tree-Search" at Chapel: CHIUW 2023: 10th Annual Chapel Implementers and Users Workshop.

As Engin says, though, nothing about this is inherently part of Chapel;
simply an abstraction for load-balancing built on top of Chapel's language
features.

Definitely can be improved but take a gander at
DynamicIters β€” Chapel Documentation 1.33.

I would say "Chapel does not implement general load-balancing in the
language or its runtime directly (as Charm++ would, for example), but
supports the ability for users to create abstractions (collections,
iterators, etc.) that provide load-balancing capabilities. The
DistributedBag and DynamicIters cases are examples of such abstractions.

Basically, our philosophy has been that we'd prefer an imperative language
that gives you a reasonably firm foundation in terms of how your program
will execute and to build more complex policies (like load balancing) in
terms of that than to have the language and runtime try to be smart but
not have any recourse when you want to control something more precisely.

That's a compiler internal that may be outdated. See
Classes β€” Chapel Documentation 1.32
for class memory management. Array allocations are typically freed at
the end of the lexical scope.

My summary here would be: The lifetime of all Chapel types (scalars,
records, arrays, etc.) are based on scoping other than classes where a
class object may outlive its scope and either be automatically freed (if
it is 'owned' or 'shared') or manually freed (if it is 'unmanaged').

Finally, I'll mention that my preferred Chapel citation is:

B. L. Chamberlain, β€œChapel,” in Programming Models for Parallel Computing,
P. Balaji, Ed. MIT Press, November 2015, ch. 6, pp. 129–159.

which is somewhat old at this point, but still the best published
reference for Chapel overall (where the website, current version of the
spec, release notes, etc. would be other more open-source artifacts to
cite).

Thanks for your interest in Chapel,
-Brad

1 Like