Unrolling For Loops

Are there guidelines for unrolling for loops?

Consider the following vaguely accurate inner product:

proc innerNormal(u : [?uD] ?R, v : [?vD] R, si : R)
{
    inline proc twosum(a : R, b : R) // compute sum and its error
    {
        const s = a + b;
        const v = s - a;

        return (s, (a - (s - v)) + (b - v));
    }
    var s = si, e = 0:R, et : R;

    for (x, y) in zip(u, v) do // unroll this
    {
        (s, et) = twosum(s, x * y);

        e += et;
    }
    return s + e;
}

This is pretty slow. I deliberately want it serial because it will often be called by multiple parallel threads.

So, we want to unroll it. Let use a factor of 2 unrolling. Anything more than 4 performs badly.

proc innerUnrolled2(u : [?uD] ?R, v : [?vD] R, si : R)
{
    inline proc twosum(a : R, b : R)
    {
        const s = a + b;
        const v = s - a;

        return (s, (a - (s - v)) + (b - v));
    }
    param zero = 0:R, odd = 0x01;
    const (urows, ) = uD.dims(), iu = urows.low, nu = urows.high;
    const (vrows, ) = vD.dims(), iv = vrows.low, nv = vrows.high;
    const m = urows.size & odd;
    var (sp, sq) = if m == odd then (si, u[iu] * v[iv]) else (si, zero);
    var (ep, eq) = (zero, zero);

    for (i, j) in zip(iu + m .. nu by 2, iv + m .. nv by 2) do
    {
        const (_sp, _ep) = twosum(sp, u[i] * v[j]);
        const (_sq, _eq) = twosum(sq, u[i + 1] * v[j + 1]);

        sp = _sp; ep += _ep;
        sq = _sq; eq += _eq;
    }
    // wrap uppin
    {
        const (s, e) = twosum(sp, sq);

        return s + (e + (ep + eq));
    }
}

Does the use of zip'ering in the for loop have adverse overhead, i.e. is the optimal way to pass through the two indices associated with the 1D-arrays u and v? If not, what is?

If I have cases where I know that the indices associated with u and v are common, i.e. i == j,, should I have a separate routine that addresss that?

The majority of the latency (work-load) in the above will be memory access although the arithmetic involved will likely be of the same order of magnitude, just not quite as much. And with some luck, we might get super-scalar processing on some CPUs. This might even vectorize in the future.

Hi Damian —

This is a really interesting question. I've wanted an unroll directive of some sort for Chapel's for loops for some time, but haven't had a syntactic proposal I've been happy with to actually put an issue together (e.g., unroll 2 for i in 1..n seems clunky). We've recently been kicking around an annotations/tags-based feature for Chapel that could be used to express things to the compiler or tools that don't necessarily change the program's observed behavior (but could change its implementation), and I think unroll is a good candidate for this. For example, imagine:

@unroll 2
for i in 1..10 do...

What I find interesting about your example is the iteration being over arrays rather than ranges or domains, which is a case I don't think I've come across in cases where I've wanted to unroll, and I think your code points out that it's definitely more awkward. And more specifically, I can't come up with a way to keep the iteration over arrays while getting the unrolling by two factor. For example, my first thought was, "could we use for (x, y) in zip(u by 2, v by 2)... but of course that doesn't give you any way to get a handle on the in-between elements.

If you wanted to put together a feature request for there to be a clean way for users to request a loop to be unrolled (even if you don't have any better examples for syntax than what I've sketched above), I think that'd be helpful, both to get on the record, and to have a user (rather than me) requesting it. I think the "iterate over arrays" example is quite compelling, though I'll note that even for ranges and domains, it'd still be really nice to ask the compiler to take care of the clunkiness of unrolling, dealing with cleanup loops, etc. And think about how much easier it would be for a user to try values of 1, 2, 4, etc. by just substituting and integer rather than doing it themselves... This is the kind of thing compilers were made for.

In the meantime, for your specific example, my head first went to:

for (i, j) in zip(u.domain by 2, v.domain by 2) { ... }

in order to avoid manually picking the domains apart into their scalar low and high bounds (where I'm assuming this will only be applied to 1D arrays u and v, right?). However, a difference between this approach and yours is that we'd need a conditional within the loop if the size of u/v was odd to avoid an OOB indexing issue on the i+1 expressions, and presumably we don't want a conditional within the loop. OTOH, I'm noticing that you don't have a cleanup iteration if the loop has an odd size (and might have a similar OOB indexing issue if it's odd?), so maybe it's safe to assume it's even? If so, I'd be curious whether the loop above works for you.

If not, then my head goes to something like:

const n = u.size;
const evenSize = n - (n%2);
for (i, j) in zip(u.domain#evenSize by 2, v.domain#evenSize by 2) { ... }

but is this any better than what you have? It feels like about a wash in terms of clarity/code size, and I'd hope it'd be a wash in terms of performance. Mostly, this again makes it feel like something it'd be nice to have the compiler manage.

The inner body of your loop looks pretty good to me. Unless you'd rather represent the operators in their original order and avoid the cleanup using something like:

    for (i, j) in zip(iu + m .. nu by 2, iv + m .. nv by 2) do
    {
        {
             const (s, et) = twosum(s, u[i] * v[j]);
             e += et;
        }
        {
             const (s, et) = twosum(s, u[i+1] * v[j+1]);
             e += et;
        }
   }

But perhaps putting the operators back in this order defeats the benefits of unrolling that you were striving for?

-Brad

Thanks. Your explanations and ideas are insightful as usual.

I generally like to unroll my own. But I will try to put some examples together to see how the idea.

But I do like the idea behind

unroll 2 for i in 1..n do
{
   ....
}

but I never saw any performance gain over what I could do manually.

I am not a fan of directives.

My mistake. I think what I was evaluating (and getting no real improvement) was something like

unroll 2 for param i in 1..8 do
{
    ... stuff
}

I think that unrolling automatically which means handling the rump is too hard to get right, whereas doing it yourself should not be too hard. Indeed, if it is hard, then the unrolling you are doing is probably wrong.