Branch: refs/heads/master
Revision: 3b1971d
Author: ronawho
Log Message:
Merge pull request #17044 from ronawho/cache-remote-by-default
Enable the cache for remote data by default
[co-developed with @mppf]
The cache for remote data can provide significant speedups for
suboptimal communication patterns, but it has historically required
users to opt-in with --cache-remote
. It was off by default because it
introduced performance regressions for some communication patterns, but
we have optimized those cases over the last year and believe the cache
will very rarely hurt performance of well optimized communication
patterns and can have massive benefits for unoptimized patterns.
The cache supports read-ahead, write-behind, and can eliminate repeated
communication. As a toy example, consider the following code:
config const n = 1_000_000;
var A, B:[1..n] int;
on Locales[1] do
for i in 1..n do
B[i] = A[i];
// A[i] normally 8-byte GET, done in 1024-byte chunks with cache read-ahead
// B[i] normally 8-byte PUT, done in 1024-byte chunks with cache write-behind
// Normally repeated GETs each iter for array metadata, only 1 GET with cache
Using cache remote on this code results in a 20x speedup on a Cray Aries
network, and a 100x speedup on FDR InfiniBand. Gains would be larger yet
on non-HPC networks like Ethernet. This isn’t a pattern most users would
write, but it’s pretty easy to introduce unintentional or non-obvious
communication in a PGAS language like Chapel. In some of our own
benchmarks we see:
- ~3x speedup for MiniMD on Aries, ~100x on InfiniBand
- ~20x speedup for PTRANS on Aries, ~500x on InfiniBand
These aren’t benchmarks we’ve tuned much so they likely have poor
communication patterns, but they are codes that users could easily
write. This demonstrates that the cache can provide substantial speedups
for more complex but untuned or naive code.
Note that credit for the cache goes to @mppf, who originally developed it
and we’ve been collaborating on performance tuning, portability, and
correctness testing over the last year or so.
The cache implementation is careful to maintain Chapel’s Memory
Consistency Model (MCM), but bugs in the cache could lead to subtle
issues or races. To help alleviate concerns about that, this is
more heavily tested than normal:
- gasnet-udp: 365 trials for -multilocale-only
- gasnet-ibv: 100 trials for release/examples runtime/configMatters
- ugni: 100 trials for release/examples runtime/configMatters
For more info about the ideas behind the cache and recent work see:
- https://chapel-lang.org/papers/pgas15-caching.pdf (detailed paper)
- https://chapel-lang.org/CHIUW/2014/ferguson-caching-chiuw.pdf (original talk)
- https://chapel-lang.org/releaseNotes/1.21/04-perf-opt.pdf slide 28-32 (recent tuning/fixes)
- https://chapel-lang.org/releaseNotes/1.23/05-ongoing.pdf slide 33-35 (recent tuning)
For a more detailed look at code changes see:
- Cache for remote data implementation and tests · chapel-lang/chapel@d78ea51 · GitHub
- Completely remove non-blocking GET support from ugni by ronawho · Pull Request #14329 · chapel-lang/chapel · GitHub
- Improve performance for large transfers under --cache-remote by ronawho · Pull Request #14352 · chapel-lang/chapel · GitHub
- Fix support for unorderedCopy with --cache-remote by ronawho · Pull Request #14355 · chapel-lang/chapel · GitHub
- Tune the transfer size for bypassing --cache-remote by ronawho · Pull Request #14619 · chapel-lang/chapel · GitHub
- Disable ugni comm/compute overlap with --cache-remote by ronawho · Pull Request #15266 · chapel-lang/chapel · GitHub
- Remove workaround for cache-remote and ugni by mppf · Pull Request #16020 · chapel-lang/chapel · GitHub
- Improve --cache-remote for RA by mppf · Pull Request #16379 · chapel-lang/chapel · GitHub
- Adjust remote cache to better handle task yields in comm events by mppf · Pull Request #16885 · chapel-lang/chapel · GitHub
Modified Files:
M compiler/main/driver.cpp
M man/chpl.rst
Compare: Comparing c551b581f0f3...3b1971d6e918 · chapel-lang/chapel · GitHub