[Chapel Merge] Enable the cache for remote data by default

Branch: refs/heads/master
Revision: 3b1971d
Author: ronawho
Log Message:

Merge pull request #17044 from ronawho/cache-remote-by-default

Enable the cache for remote data by default

[co-developed with @mppf]

The cache for remote data can provide significant speedups for
suboptimal communication patterns, but it has historically required
users to opt-in with --cache-remote. It was off by default because it
introduced performance regressions for some communication patterns, but
we have optimized those cases over the last year and believe the cache
will very rarely hurt performance of well optimized communication
patterns and can have massive benefits for unoptimized patterns.

The cache supports read-ahead, write-behind, and can eliminate repeated
communication. As a toy example, consider the following code:

config const n = 1_000_000;
var A, B:[1..n] int;

on Locales[1] do
  for i in 1..n do
    B[i] = A[i];

// A[i] normally 8-byte GET, done in 1024-byte chunks with cache read-ahead
// B[i] normally 8-byte PUT, done in 1024-byte chunks with cache write-behind
// Normally repeated GETs each iter for array metadata, only 1 GET with cache

Using cache remote on this code results in a 20x speedup on a Cray Aries
network, and a 100x speedup on FDR InfiniBand. Gains would be larger yet
on non-HPC networks like Ethernet. This isn’t a pattern most users would
write, but it’s pretty easy to introduce unintentional or non-obvious
communication in a PGAS language like Chapel. In some of our own
benchmarks we see:

  • ~3x speedup for MiniMD on Aries, ~100x on InfiniBand
  • ~20x speedup for PTRANS on Aries, ~500x on InfiniBand

These aren’t benchmarks we’ve tuned much so they likely have poor
communication patterns, but they are codes that users could easily
write. This demonstrates that the cache can provide substantial speedups
for more complex but untuned or naive code.

Note that credit for the cache goes to @mppf, who originally developed it
and we’ve been collaborating on performance tuning, portability, and
correctness testing over the last year or so.

The cache implementation is careful to maintain Chapel’s Memory
Consistency Model (MCM), but bugs in the cache could lead to subtle
issues or races. To help alleviate concerns about that, this is
more heavily tested than normal:

  • gasnet-udp: 365 trials for -multilocale-only
  • gasnet-ibv: 100 trials for release/examples runtime/configMatters
  • ugni: 100 trials for release/examples runtime/configMatters

For more info about the ideas behind the cache and recent work see:

For a more detailed look at code changes see:

Modified Files:
M compiler/main/driver.cpp
M man/chpl.rst

Compare: Comparing c551b581f0f3...3b1971d6e918 · chapel-lang/chapel · GitHub