Last Non-Zero Column in a 2D-array

Is there an elegant way to find the last Non-Zero Column in a reasonably dense 2D array which does not causes cache misses? The idea is that given a matrix operation, say M*N, if there is a chance that the
last few rows or columns might be all zero, removing them from that operation saves time. For an dense matrix which has few (or no) such trailing rows, searching will very quickly tell you so. For one which has
such zero columns, the overhead is (generally) worth it. Well, the LAPACK people say so in some places, and choose to ignore ignore it in other places. And I am doing benchmarks against what LAPACK does. So I need to be comparing applies with apples.

I can find the last Non-Zero Row with code has no cache-miss issues. But then accessing, by-rows, a 2D Chapel array gives me contiguous memory access so that was easy.

To find the last Non-Zero column, the best code I can do is the following. A reduce guru can probably do all this in one line. But that guru is certainly not me. Any help?

proc lastNonZeroEntry(t : [?tD] ?R, lo, hi) // t is a 1D-array
{
    param zero = 0:R;
    var j = hi;

    while j > lo && t[j] == zero do
    {
        j -= 1;
    }
    return j;
}

proc lastNonZeroColumn(a : [?aD] ?R) // 
{
    param zero = 0:R;
    const (rows, columns) = aD.dims();
    const n = columns.high;
    var c = columns.low;

    for i in rows do // step through the 2D array to minimize cache misses
    {
        ref ai = a[i, ..]; // access the whole row (contiguous in memory)

        if ai[n] != zero then // treat the simplest case directly
        {
            c = n;
            break;
        }
        else // update the maximum column with at least one non-zero entry
        {
            c = lastNonZeroEntry(ai, c, n - 1);
        }
    }
    return c;
}

And yes, I probably should precompute n-1. But then the code is not as clear.

And yes, there are some smarts you can put in there to exit more rapidly if, at some stage during this search process, the performance gain possible from trimming the matrix size is deemed to not be worth the effort. For example, if say the gain is only 25%. But let's leave that for now.

Hi Damian —

Clearing out my inbox before the holidays, I'm finding that nobody ever replied to this, for which I apologize. I don't have experience with this pattern, but my first intuition for a serial algorithm was much like yours: I was going to search the initial row backwards for the first non-zero to find the rightmost non-zero column; and then search subsequent rows from that column to the right.

var lastNonZeroCol = columns.first;
const r = rows.first;
for c in cols by -1 {
  if a[r,c] != zero {
    lastNonZeroCol = c;
    break;
  }
}
for r in rows.first+1..rows.last {
  for c in lastNonZeroCol+1..columns.last {
    if a[r,c] != zero {
      lastNonZeroCol = c;
    }
  }
}

but then thinking about parallel implementations, you might want to do something like max reduce [r in rows] findLastNonZeroColumn(a, r); which would make this a full n**2 algorithm, yet the parallelism might be worth it? Or it'd even be possible to write a custom reduction for this in which each task could keep track of the rightmost column it had found so far, only searching to the right of it (though writing a custom reduction is not very elegant or well-defined in the language yet, so this causes pain on the implementer's part in order to make the application of it the one-liner you were wondering about).

I'm going to stop there for now. Sorry again that this question got away from us.

-Brad