Are there guidelines for unrolling for loops?
Consider the following vaguely accurate inner product:
proc innerNormal(u : [?uD] ?R, v : [?vD] R, si : R)
{
inline proc twosum(a : R, b : R) // compute sum and its error
{
const s = a + b;
const v = s - a;
return (s, (a - (s - v)) + (b - v));
}
var s = si, e = 0:R, et : R;
for (x, y) in zip(u, v) do // unroll this
{
(s, et) = twosum(s, x * y);
e += et;
}
return s + e;
}
This is pretty slow. I deliberately want it serial because it will often be called by multiple parallel threads.
So, we want to unroll it. Let use a factor of 2 unrolling. Anything more than 4 performs badly.
proc innerUnrolled2(u : [?uD] ?R, v : [?vD] R, si : R)
{
inline proc twosum(a : R, b : R)
{
const s = a + b;
const v = s - a;
return (s, (a - (s - v)) + (b - v));
}
param zero = 0:R, odd = 0x01;
const (urows, ) = uD.dims(), iu = urows.low, nu = urows.high;
const (vrows, ) = vD.dims(), iv = vrows.low, nv = vrows.high;
const m = urows.size & odd;
var (sp, sq) = if m == odd then (si, u[iu] * v[iv]) else (si, zero);
var (ep, eq) = (zero, zero);
for (i, j) in zip(iu + m .. nu by 2, iv + m .. nv by 2) do
{
const (_sp, _ep) = twosum(sp, u[i] * v[j]);
const (_sq, _eq) = twosum(sq, u[i + 1] * v[j + 1]);
sp = _sp; ep += _ep;
sq = _sq; eq += _eq;
}
// wrap uppin
{
const (s, e) = twosum(sp, sq);
return s + (e + (ep + eq));
}
}
Does the use of zip'ering in the for loop have adverse overhead, i.e. is the optimal way to pass through the two indices associated with the 1D-arrays u and v? If not, what is?
If I have cases where I know that the indices associated with u and v are common, i.e. i == j,, should I have a separate routine that addresss that?
The majority of the latency (work-load) in the above will be memory access although the arithmetic involved will likely be of the same order of magnitude, just not quite as much. And with some luck, we might get super-scalar processing on some CPUs. This might even vectorize in the future.