19237, "aconsroe-hpe", "Can you make our n-body benchmark faster?", "2022-02-11T15:14:14Z"
Today, Chapel is about 3x slower than the top entry in the Computer Language Benchmarks Game's nbody problem.
Your mission, if you choose to accept it, is to try to speed it up.
I suggest using nbody.chpl as a starting point and from there, a few interesting directions to head in are:
-
Unroll the outer, inner, or both loops in
advance()
(and possiblyenergy()
, though its only called twice)- Chapel supports unrolling with a
for param i in 0..<numBodies
-
numBodies
has to be madeparam
to be used in afor param
- Alt: some implementations precompute the list of
(i, j)
pairs sincenumBodies
is known for this benchmark- Can you also make this
param
? And thenfor param
over those?
- Can you also make this
- Chapel supports unrolling with a
-
Rewrite the "object centric" into a "data centric" implementation
- Today the calculations are focused on the
record body
and stored in an array-of-structs fashion - Other implementations use a struct-of-arrays where all positions for example are stored contiguously
- Does this make the code harder to follow (this will vary by person and familiarity)? Is the code shorter or longer?
- Today the calculations are focused on the
-
Try to get some better vectorization
- In some quick experiments, we see that operations on a
3*real
get vectorized, but does not utilize a full 256bit vector instruction (it does a 2x64 and 1x64; ideally we would get a 3x64 but the hardware needs to see a 4x64)- Can you expand to a
4*real
? Is there a nicer user interface to write3*real
but really get4*real
without having to insert 0's everywhere?
- Can you expand to a
- A very specific optimization other implementations do is a vectorized reciprocal square-root (
vrsqrt
) on 4x64. If you look closely though, they have to do so by first doing avrsqrt
on 4x32 because that is what the hardware supports. They then fixup the precision to get a better 64 bit result.- Can you write this in an extern block in Chapel? How clunky is it to use?
- Without resorting to any extern blocks, can we get similar benefit by having the code look more obviously to the backend like
1 / sqrt(x:real(32))
? - Does this optimization depend on
-ffast-math
? (or--no-ieee-float
)
- In some quick experiments, we see that operations on a
Some combination of 2 & 3 may yield compound improvements, because you could lay out all velocities as continguous 4*real
Our primary interest is increased performance but without impacting the cleanliness of the code. That being said, a version that was as fast as possible is also fair game.