New Issue: Can you make our n-body benchmark faster?

aconsroe-hpe1 · February 11, 2022, 3:16pm

19237, "aconsroe-hpe", "Can you make our n-body benchmark faster?", "2022-02-11T15:14:14Z"

Can you make our n-body benchmark faster?

opened 03:14PM - 11 Feb 22 UTC

type: Feature Request area: Tests

Today, Chapel is about 3x slower than the top entry in the Computer Language Ben…chmarks Game's [nbody](https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/nbody.html) problem. Your mission, if you choose to accept it, is to try to speed it up. I suggest using [nbody.chpl](test/release/examples/benchmarks/shootout/nbody.chpl) as a starting point and from there, a few interesting directions to head in are: 1) Unroll the outer, inner, or both loops in `advance()` (and possibly `energy()`, though its only called twice) * Chapel supports unrolling with a `for param i in 0..<numBodies` * `numBodies` has to be made `param` to be used in a `for param` * Alt: some implementations precompute the list of `(i, j)` pairs since `numBodies` is known for this benchmark * Can you also make this `param`? And then `for param` over those? 2) Rewrite the "object centric" into a "data centric" implementation * Today the calculations are focused on the `record body` and stored in an array-of-structs fashion * Other implementations use a struct-of-arrays where all positions for example are stored contiguously * Does this make the code harder to follow (this will vary by person and familiarity)? Is the code shorter or longer? 3) Try to get some better vectorization * In some quick experiments, we see that operations on a `3*real` get vectorized, but does not utilize a full 256bit vector instruction (it does a 2x64 and 1x64; ideally we would get a 3x64 but the hardware needs to see a 4x64) * Can you expand to a `4*real`? Is there a nicer user interface to write `3*real` but really get `4*real` without having to insert 0's everywhere? * A very specific optimization other implementations do is a vectorized reciprocal square-root (`vrsqrt`) on 4x64. If you look closely though, they have to do so by first doing a `vrsqrt` on 4x32 because that is what the hardware supports. They then fixup the precision to get a better 64 bit result. * Can you write this in an extern block in Chapel? How clunky is it to use? * Without resorting to any extern blocks, can we get similar benefit by having the code look more obviously to the backend like `1 / sqrt(x:real(32))`? * Does this optimization depend on `-ffast-math`? (or `--no-ieee-float`) Some combination of 2 & 3 may yield compound improvements, because you could lay out all velocities as continguous `4*real` Our primary interest is increased performance but without impacting the cleanliness of the code. That being said, a version that was as fast as possible is also fair game.

Today, Chapel is about 3x slower than the top entry in the Computer Language Benchmarks Game's nbody problem.

Your mission, if you choose to accept it, is to try to speed it up.

I suggest using nbody.chpl as a starting point and from there, a few interesting directions to head in are:

Unroll the outer, inner, or both loops in advance() (and possibly energy(), though its only called twice)
- Chapel supports unrolling with a for param i in 0..<numBodies
- numBodies has to be made param to be used in a for param
- Alt: some implementations precompute the list of (i, j) pairs since numBodies is known for this benchmark
  - Can you also make this param? And then for param over those?
Rewrite the "object centric" into a "data centric" implementation
- Today the calculations are focused on the record body and stored in an array-of-structs fashion
- Other implementations use a struct-of-arrays where all positions for example are stored contiguously
- Does this make the code harder to follow (this will vary by person and familiarity)? Is the code shorter or longer?
Try to get some better vectorization
- In some quick experiments, we see that operations on a 3*real get vectorized, but does not utilize a full 256bit vector instruction (it does a 2x64 and 1x64; ideally we would get a 3x64 but the hardware needs to see a 4x64)
  - Can you expand to a 4*real? Is there a nicer user interface to write 3*real but really get 4*real without having to insert 0's everywhere?
- A very specific optimization other implementations do is a vectorized reciprocal square-root (vrsqrt) on 4x64. If you look closely though, they have to do so by first doing a vrsqrt on 4x32 because that is what the hardware supports. They then fixup the precision to get a better 64 bit result.
  - Can you write this in an extern block in Chapel? How clunky is it to use?
  - Without resorting to any extern blocks, can we get similar benefit by having the code look more obviously to the backend like 1 / sqrt(x:real(32))?
  - Does this optimization depend on -ffast-math? (or --no-ieee-float)

Some combination of 2 & 3 may yield compound improvements, because you could lay out all velocities as continguous 4*real

Our primary interest is increased performance but without impacting the cleanliness of the code. That being said, a version that was as fast as possible is also fair game.

Topic		Replies	Views
Announcing Chapel 1.28.0! Announcements	0	328	September 16, 2022
Chapel on an open-science Nvidia GPU + ARM system Users	6	296	September 20, 2021
Announcing Chapel version 1.19! Announcements	0	234	March 22, 2019
Announcing Chapel 1.27.0! Announcements	0	278	June 30, 2022
Announcing Chapel 1.22.0! Announcements	0	244	April 16, 2020

New Issue: Can you make our n-body benchmark faster?

Related Topics