Consider the following code snippet (all the vectors are actually of type real(w)).
inline proc dot(f, g, p, q) // p and q are scalar, the rest are vectors
{
// this algebra should be written using 3 FMAs to minimize errors
// by effectively doing the calculations at additional precision!
return f * p + g * q;
}
inline proc visc(r, m, l, t, s, p, q) // same comment as for dot above
{
return dot((r - m) + (l - m), (t - m) + (s - m), p, q);
}
inline proc conv(f, g, c, m, l, t, s, p, q) // same comment as for dot above
{
return dot((t - s) * ((t + s) + (m + m)), c - (l + m) * (f + g), p, q);
}
.......
// whole lot of declarations
....
forall i in 2..Nx-1 do
{
const jm1 = 1..Ny-2, j = 2..Ny-1, jp1 = 3..Ny, jp2 = 4..Ny+1, ip1 = i+1;
const ref uni = un[i, ..], unip1 = un[ip1, ..];
const ref vni = vn[i, ..], vnip1 = vn[ip1, ..];
const ref ur = unip1[jp1], um = unip1[j], uij = uni[j];
const ref vr = vnip1[jp1], vm = vni[jp1], vij = vni[j];
const uvrm = (um + ur) * (vm + vr);
// in X
{
const ref ul = unip1[jm1], vip1j = vnip1[j], ui2j = un[i+2, j];
ref vx = VISCx[ip1, j], cx = CONVx[ip1, j];
vx = visc(ur, um, ul, ui2j, uij, rdy2, rdx2);
cx = conv(vip1j, vij, uvrm, um, ul, ui2j, uij, rqdx, rqdy);
ustar[ip1, j] = um + dt * (damp * vx - cx);
}
// in Y
{
const ref vl = vn[i-1, jp1], uijp1 = uni[jp1], vij2 = vni[jp2];
ref vy = VISCy[i, jp1], cy = CONVy[i, jp1];
vy = visc(vr, vm, vl, vij2, vij, rdx2, rdy2);
cy = conv(uijp1, uij, uvrm, vm, vl, vij2, vij, rqdy, rqdx);
vstar[i, jp1] = vm + dt * (damp * vy - cy);
}
}
Within that inner loop, there are about 41 vector operations, all of length Ny-2. They can be written in scalar form to live within a single for or foreach loop which would have a single branch.
But, in the form above, where the algebra is written in terms of vector operations, how many branches will be created? Is it still just one?
Thanks.