This needs some significant discussion which can come later. But for now, I am only suggesting an
interim solution. And by the way, Discourse has said this post is similar (I guess in context) to several others, none of which mentioned anything about Fused Multiple Addition. Yes, they are related to floating point but that is a long bow to draw. If this does belong in another item, or as a Github Issue, feel free to move it (but let me know please because when it is moved, not even the original author is notified).
At the moment, an FMA can be achieved by calling the (C) maths library routine fma or *fmaf as in
inline proc _fma(x : ?Real, y : Real, z : Real) where Real == real(32)
{
extern proc fmaf(x : real(32), y : real(32), z : real(32)) : real(32);
return fmaf(x, y, z);
}
inline proc _fma(x : ?Real, y : Real, z : Real) where Real == real(
{
extern proc fma(x : real(64), y : real(64), z : real(64)) : real(64);
return fma(x, y, z);
}
Directly calling those routines are slow.
Until there is a discussion about how Chapel should support FMA in its arithmetic without calling those routines, addressing the performance issues in the above is moot.
In the meantime, can we get a compiler flag which will force the back-end to do what -mfma achieves in many C compilers and map any explicit call to what appears to be the maths library rouitine fma (or fmaf) to a macro which will directly produce a Fused Multiply machine instruction. This may need to be totally separate from --fast because it has to work when --ieee-float is invoked with --fast.
By way of giving people something to exercise the little grey cells in some quiet moments of reflection, one might consider the following which comes from the way Rust approaches this topic.
Code such as
(x * y)
should ALWAYS be done as a conventional multiplication with the result rounded to the precision of the x if its precision or exceeds matches that of y, or that oif y otherwise.
Similarly, code such as
(x * y + z)
or even
z += x * y;
is always evaluated (using the above proc definitions) as the respective equivalents of
_fma(x, y, z)
or
z = fma(x, y, z);
On the other hand, how code like the following would be interpreted is anybody's guess
(x * y + p * q)
That said, more precise code like
(x * y + (p * q))
would of course follow those rules just mentioned.
And of-course, the compiler can be forced to not emit FMA instructions.
Applications are many-fold.
But for now, only the compiler is remotely mission critical. The rest needs feedback and discussion.