Matrix Transpose

For the limited case where you have an array

v[1..m, 1..n]

which you want as a copy of the transpose of

u[1..n, 1..m]

where those ranges can be anything, just matching, you could have

proc cotranspose(u : [?uD] ?R, v : [?vD] R)
{   
    const (rows, columns) = uD.dims();
    const r = rows.size, c = columns.size;
    param B = 16;

    if r <= B && c <= B then
    {
        for (i, j) in uD do
        {
            v[j, i] = u[i, j];
        }
    }
    else if r < c then
    {
        const (cb, ce) = (columns.low, columns.high);
        const s = cb + c / 2;
        const lower = cb .. s;
        const upper = s + 1 .. ce;

        cotranspose(u[rows, lower], v[lower, rows]);
        cotranspose(u[rows, upper], v[upper, rows]);
    }
    else
    {
        const (rb, re) = (rows.low, rows.high);
        const s = rb + r / 2;
        const lower = rb .. s;
        const upper = s + 1 .. re;

        cotranspose(u[lower, columns], v[columns, lower]);
        cotranspose(u[upper, columns], v[columns, upper]);
    }
}

It also suffers from some of the same problems as mentioned in #20803 which we will address soon.

Do we really need to handle strided arrays? In my use in linear algebra, I have never needed to do so. in decades. But maybe some oif the newer uses iin ML and stuff might need it. Any other comments before I spend some time on it.

In this era of vectorization, the forall over both indices might probably be better done as a for down each row with each then row processed by a foreach. Needs some testing, preferably across several different architectures such as X86-64 and ARM (like on the current Mac).

Damian —

I may be missing something but, to me, the use of recursion here feels like unnecessary overhead and complexity. What's the motivation?

I'm also OK with not supporting strided index sets for cases initially, as long as there are checks to make sure someone doesn't try to use them.

I'd also add some sort of assertions that vD's dims match uD's (transposed) to avoid OOB errors.

-Brad

Good point. There must have been or intended to be a forall or cobegin or something?

I suggest renaming the formals to ref dest: .... and const ref src: .... , making dest the first formal like what we do in other cases.

Try this

// cach oblivious transpose

proc cotranspose(dst : [] ?R, src : [] R, blocksize : int)
{
    proc split(g : range, n : int)
    {
        const lo = g.low, cut = lo + n / 2;

        return (lo .. cut, cut + 1 .. g.high);
    }
    proc trn(rows : range, columns : range, v : [] R, u : [] R)
    {
        const r = rows.size, c = columns.size;

        if r <= blocksize && c <= blocksize then
        {
            for (r, c) in {rows, columns} do // could be a 'forall'
            {
                v[c, r] = u[r, c];
            }
        }
        else if r < c then
        {
            const (lower, upper) = split(columns, c);

            trn(rows, lower, v, u);
            trn(rows, upper, v, u);
        }
        else
        {
            const (lower, upper) = split(rows, r);

            trn(lower, columns, v, u);
            trn(upper, columns, v, u);
        }
    }
    const (rows, columns) = src.domain.dims();
    const (_rows, _columns) = dst.domain.dims();

    assert(rows.stride == 1 && columns.stride == 1);

    assert(rows.size == _columns.size && _rows.size == columns.size);
    
    trn(rows, columns, dst.reindex(columns, rows), src);
}

I normally use such a routine in a serial scenario where I guarantee that stride is always 1 and than the dimension of thr transpose is correct.

I would probably use a blocksize of 8 or 16 but there may be better choices. Try 32 for yourself?

The for statement could just as easily by a forall*. Whether than makes sense iun a scenario where at best, the copy is done between which are implicitly a copy of two hypermatrices which are never bigger than 16*16 is another thing.

Is recursion overkill. Maybe.What overhead does recursion bring within Chapel? See also

You could also consider the following for the parallel case where those asserts have disappeared in the name of brevity.

// parallel transpose

proc patranspose(ref dst : [?D] ?R, const ref src : [?S] R, blocksize : int)
{
    proc hyperRange(k) // range for hypermatrices
    {
        return 0 .. (k + blocksize - 1) / blocksize - 1;
    }
    proc blockRange(lo, i, k) // range within a hypermatrix
    {
        const _lo = lo + i * blocksize;

        return _lo .. min(_lo + blocksize - 1, k);
    }
    proc trn(rows : range, columns : range, v : [] R, u : [] R)
    {
        const (m, rlo) = (rows.size, rows.low);
        const (n, clo) = (columns.size, columns.low);

        forall (r, c) in { hyperRange(m), hyperRange(n) }
        {
            for (i, j) in { blockRange(rlo, r, m), blockRange(clo, c, n) }
            {
                v[j, i] = u[i, j];
            }
        }
    }
    const (rows, columns) = S.dims();

    trn(rows, columns, dst.reindex(columns, rows), src);
}

That innermost for from patranspose might be better vectorized? Try it yourself.

I grabbed cotranspose from a routine I wrote to compute the Jacobi Singular Value Decomposition (SVD). The transposition was originally written to show how easy recursion was in Chapel. Then I was too lazy to rewrite it recursion free because 99.9% of the computational effort in the routine was about computing the SVD. Also, because I was in control of allocating the transpose matrix which was the destination, I did not need those assertions.

As an aside, when I write routines for Linear Algebra, I never check for unit stride. I assume that anybody doing Linear Algebra is smart enough to always use unit stride. Most of us have spent our careers trying to craft algorithms than only use unit stride. And today's hardware punishes you big time if you try and use anything else anyway.

With a blocksize of 16 and 32, and SMP, as written, the recursive version is slighter faster for matrix sizes less than 100x100.

Wrapping everything in a serial block and replacing the forall in patranspose with a for, for matrices less than 1000x1000 and with a blocksize of 32, the recursive version is still faster, and with a blocksize of 16, the recursive version is faster for matrices under 100x100 and it still competitive up to 1000x1000.

I saw a big flaw in my own code. For small matrix sizes, e.g. less than 32x32 (maybe), it makes no sense to do anything smart, not even simplistic parallel. So, for that case, a simple for over the domain is still the fastest by nearly an order of magnitude. Even the routine in LinearAlgebra suffers from the same problem.

Based on the useful feedback, I will lump all of this into a GitHub issue. It will need lots of testing across several architectures to get a good idea of a robust and generic-enough size below which a fancy method is counterproductive, and the blocksize to use in the core routine of each approach when doing something fancy is worthwhile.

Thanks - Damian