New Issue: Can you make our n-body benchmark faster?

19237, "aconsroe-hpe", "Can you make our n-body benchmark faster?", "2022-02-11T15:14:14Z"

Today, Chapel is about 3x slower than the top entry in the Computer Language Benchmarks Game's nbody problem.

Your mission, if you choose to accept it, is to try to speed it up.

I suggest using nbody.chpl as a starting point and from there, a few interesting directions to head in are:

  1. Unroll the outer, inner, or both loops in advance() (and possibly energy(), though its only called twice)

    • Chapel supports unrolling with a for param i in 0..<numBodies
    • numBodies has to be made param to be used in a for param
    • Alt: some implementations precompute the list of (i, j) pairs since numBodies is known for this benchmark
      • Can you also make this param? And then for param over those?
  2. Rewrite the "object centric" into a "data centric" implementation

    • Today the calculations are focused on the record body and stored in an array-of-structs fashion
    • Other implementations use a struct-of-arrays where all positions for example are stored contiguously
    • Does this make the code harder to follow (this will vary by person and familiarity)? Is the code shorter or longer?
  3. Try to get some better vectorization

    • In some quick experiments, we see that operations on a 3*real get vectorized, but does not utilize a full 256bit vector instruction (it does a 2x64 and 1x64; ideally we would get a 3x64 but the hardware needs to see a 4x64)
      • Can you expand to a 4*real? Is there a nicer user interface to write 3*real but really get 4*real without having to insert 0's everywhere?
    • A very specific optimization other implementations do is a vectorized reciprocal square-root (vrsqrt) on 4x64. If you look closely though, they have to do so by first doing a vrsqrt on 4x32 because that is what the hardware supports. They then fixup the precision to get a better 64 bit result.
      • Can you write this in an extern block in Chapel? How clunky is it to use?
      • Without resorting to any extern blocks, can we get similar benefit by having the code look more obviously to the backend like 1 / sqrt(x:real(32))?
      • Does this optimization depend on -ffast-math? (or --no-ieee-float)

Some combination of 2 & 3 may yield compound improvements, because you could lay out all velocities as continguous 4*real

Our primary interest is increased performance but without impacting the cleanliness of the code. That being said, a version that was as fast as possible is also fair game.