Last Non-Zero Column in a 2D-array

Is there an elegant way to find the last Non-Zero Column in a reasonably dense 2D array which does not causes cache misses? The idea is that given a matrix operation, say M*N, if there is a chance that the
last few rows or columns might be all zero, removing them from that operation saves time. For an dense matrix which has few (or no) such trailing rows, searching will very quickly tell you so. For one which has
such zero columns, the overhead is (generally) worth it. Well, the LAPACK people say so in some places, and choose to ignore ignore it in other places. And I am doing benchmarks against what LAPACK does. So I need to be comparing applies with apples.

I can find the last Non-Zero Row with code has no cache-miss issues. But then accessing, by-rows, a 2D Chapel array gives me contiguous memory access so that was easy.

To find the last Non-Zero column, the best code I can do is the following. A reduce guru can probably do all this in one line. But that guru is certainly not me. Any help?

proc lastNonZeroEntry(t : [?tD] ?R, lo, hi) // t is a 1D-array
{
    param zero = 0:R;
    var j = hi;

    while j > lo && t[j] == zero do
    {
        j -= 1;
    }
    return j;
}

proc lastNonZeroColumn(a : [?aD] ?R) // 
{
    param zero = 0:R;
    const (rows, columns) = aD.dims();
    const n = columns.high;
    var c = columns.low;

    for i in rows do // step through the 2D array to minimize cache misses
    {
        ref ai = a[i, ..]; // access the whole row (contiguous in memory)

        if ai[n] != zero then // treat the simplest case directly
        {
            c = n;
            break;
        }
        else // update the maximum column with at least one non-zero entry
        {
            c = lastNonZeroEntry(ai, c, n - 1);
        }
    }
    return c;
}

And yes, I probably should precompute n-1. But then the code is not as clear.

And yes, there are some smarts you can put in there to exit more rapidly if, at some stage during this search process, the performance gain possible from trimming the matrix size is deemed to not be worth the effort. For example, if say the gain is only 25%. But let's leave that for now.