I re-implemented this in Chapel. There were 8 transmutes in the routine.
Either way, compared to the library version which Chapel used via AutoMath:
Version A in pure Chapel is 2.56 times faster
- has several branches but no table lookups but multiplications minimized
Version B in pure Chapel is 2.70 times faster
- nearly branchless programming at the cost of table lookups and extra multiplications
Version A will benefit somewhat once likely/unlikely works although for really huge and really tiny complex numbers, it will slow down.
Version B will benefit somewhat once an array can be put in program static data. It is independent of the size of the numbers.
I know that the performance gain is not all mine.
A lot of that gain must be down to the Chapel compiler itself or at the very least, me being able to express my algorithm optimally in Chapel. Congratulations Chapel.
A lot of the complex(w) elementary routines, well in the underlying C code, make use of other elementary functions. That means that you often have duplicate argument/parameter checking for edge values which need special handling, i.e. when either the real or imaginary component are a zero, an infinity or even a Not-a-Number (NaN). Also, there is sometimes wrap-up code in the internally called routine and in the user-called routine. More duplication. My approach avoids those issues. There were some other improvements, not least being taking an anal, i.e. beyond rigorous, approach to the sum of squares that occur inside these routines. So I end up with more lines of code because I am manually expanding some internal function calls within the primary routine. But as you can see from the performance, my changes do not incur with any penalty.
Also, the code design diverges from existing practice to favour readability of the code against compactness of code (which then makes it easier to verify the code correctness, especially against Annex G of the C standard which is the definitive reference for what should be done for extreme values). So we are trying to evaluate the effectiveness of this approach to this part of the overall algorithm which is crucial to correct but which most users will never encounter unless they have lots of NaN s or infinities or zeros in their data. We have yet to bounce it off some others in the wider numerical analysis community (many of whom have heard of Chapel but pretty well most of whom have never read it). All I can say is that I would hate to be trying to do this in Julia.
None of this work would be possible without the transmute() feature of Chapel.
This pure Chapel implementation means that I do have access to fma(), and exploit it across the board. That has been in IEEE 754 since 2008 so I do not think I am being extravagant in demanding to be able to use this feature of the underlying hardware. And Chapel always expands sqrt() inline which also helps this pure Chapel implement. Thanks for both of those Michael.
It is meant as an almost drop-in replacement. Because it uses fma(), it has to live in Math.
// Complex Square Root
// - avoids calling hypot altogether for performance
// -- avoids duplication of Special Case Handling in 'hypot'
// -- handle the case of largely real(w/2) or imag(w/2) directly
// -- replaces core internals with a variant of Blue's algorithm
// -- uses a simpler technique to avoid overflow scaling
// -- avoids the slightly messier technique of MUSL + GLIBC
// - avoids the use of copysign altogether
// - cleaner/clearer albeit more verbose Special Case handling
// - consider the use of 'rsqrt' for performance at some stage
// Paul Friedland, Algorithm 312: Absolute Value and Square Root of
// a Complex Number, Oct 1967, Communications of the ACM, 10(10):665
// The roundoff error from the addition |a|+sqrt(a^2+b^2) overwhelms
// the roundoff error in the computation of the square root term. So
// the algorithm largely self minimizes its own error.
proc sqrt(z : complex(?w))
{
...
}
and you would have to remove the equivalent from AutoMath. Or some other appropriate permutation.
For completeness, the error is the same in both the new implementation and that of Math. It is approximately 2.2 ULP(|z.re|) in the real part and 1.0 ULP(|z.im|) from testing with real(64) and real(32). With some final tweaks, the speed-up just over 3.0 for complex(64), being slightly less at 2.9 for complex(128)
First of all, that's really exciting, nicely done Damian!
Second, I don't think it'll have to live in Math just because of its
reliance on fma - we can put the use/import of the Math module inside
the body of the function itself so that the module is only brought in
when sqrt is used. Square root is a really important operation, and I
think it would be ideal to avoid the potential confusion of "why is it
defined in two places", by having this new version replace the old
implementation.
We could also consider evaluating if fma should be promoted to an
automatically available feature - iirc, we added it to Math out of
caution due to it being relatively new and wanting more experience with
it, and it's unstable. So part of stabilizing fma could potentially
involve making it automatically available.
Still more testing to be done. So ditching the old is a way off as yet. Also, there is scaling at the core of this routine. Once Github Issue 25088 is finished and its seems close, they I have to cover real(16) arguments with their underlying small exponent range. Any real(16) function, or in this case a complex(32) routine. will often have problems with scaling formula so we might have to qualify the generic nature of this routine.
Also, the routine would benefit from Github Issue 27136 which is still only a twinkle in Brad's eye (because it took me so so long to clearly explain the underlying issue).
As a general comment, my experience so far is that most of the elementary mathematical functions would be faster if written in pure Chapel because Chapel makes it easier to write better algorithms or better write an existing algorithm. But that is a phenomenally huge task across the spectrum of the elementary mathematical functions. I happen to experiment with these routines to highlight issues we see elsewhere in other code including finite element (FE) algorithms. Not only are they not further complicated by any parallel considerations, but I figure people will more easily understand (or find references to) that mathematics than they would compared to the sort of weirder stuff we might normally program.
There is an unexplained 2**(-29) relative error in the real(32) variant, i.e. 9% of the machine tolerance which does not exist in the real(64). I need to get to the bottom of that before I am happy with my generic version. I know how to avoid it (which is what everybody else does) but I do not know why it happens. I want to get as close to complete with testing using real(32) code because there are only a smallish number of such IEEE 754 numbers in the universe. I then rely on the generic nature of a real(64) number to simplify and reduce the level of detail in my final testing.
Can somebody tell what the 2nd routine is doing. This is an excerpt from AutoMath. Is there a low level primitive that does a complex square root which does not call the C routine csqrt().
/* Returns the square root of the argument `x`. */
inline proc sqrt(x: complex(128)): complex(128) {
pragma "fn synchronization free"
pragma "codegen for CPU and GPU"
extern proc csqrt(x: complex(128)): complex(128);
return csqrt(x);
}
/* Returns the square root of the argument `x`. */
proc sqrt(param x: complex(128)) param : complex(128) {
return __primitive("sqrt", x);
}
The 2nd overload is doing a param computation, so it needs to be implemented within the compiler. It wouldn't be possible to implement it with an extern proc since we don't assume that these are callable at compile-time. The "sqrt" primitive is what tells the implementation within the compiler to use its internal code to compute the square root of x at compile time.
Chapel code shouldn't run on a GPU unless it explicitly or implicitly contains an on clause directing it to run on one. Though I haven't seen your code, I'm guessing it does not, so would imagine it would strictly run on a CPU by default.
But maybe what you're asking is: As the author of the code, is there something you would need to do if a user were to invoke it within a GPU on-clause? Say they wrote:
on here.gpus[0] {
forall c in myArrayOfComplex do
c = sqrt(c);
}
I believe nothing should inherently "go wrong" here. Either the body of your function will be capable of being compiled for the GPU, so it will run there; or it will not be capable of doing so, so it will run on the CPU (that is, just because Chapel code is within a GPU on-clause, does not mean it will necessarily run on the GPU if it cannot be). If a behavior other than one of those two happened, that would arguably be a bug on our part.
I don't think we've discussed the possibility of have a way to flag code as "I should never run on a GPU even if I can be." We could, but I don't see a clear use case for it at present.
I noted the mention of GPU in the pragma of the compex square root definition in AutoMath and was worried my own code needed to do something extra. Thanks for putting my fears to rest Brad.
I'm admittedly not familiar with the role of that pragma, though my impression is that it says "even though this is an extern proc, it will be OK to run on the GPU. Tagging @e-kayrakli and @stonea who can probably validate that guess quickly.
Well, I don't know about "quickly", but I have an answer. What Brad said is correct. More details:
The compiler doesn't quite "see" extern procs. As the user, you are assuring that the function you are declaring at that point will exist at link time. So, the compiler will just work with that assumption trusting you.
When GPUs are involved, if you have an extern proc that you want to execute on the GPU, the actual definition of that function must have __device__ in it. But then again, the compiler has no way of knowing that (linker does). So, the compiler acts conservatively and assumes that extern procs by default are not GPU-eligible. If you have an otherwise GPU-eligible loop that calls one such extern proc, that loop will be seen as GPU-ineligible because of that call. But what if that function indeed had __device__ and you know what you are doing and want that function to be called from the device? Then, you add that pragma. In effect, that prevents GPU offload to be thwarted when seeing extern calls.
@damianmoz -- if you don't have access to GPUs, I'd be happy to give your branch an early test to see if it acts as you'd expect with a GPU.