Hello! I would like to know more information about how can we implement data distribution effectively for sparse tensors in chapel over multiple locales?

Currently I am only aware of the method where I use some sort of distribution for the backing dense domain ( E.g. Block Distribution ) , and then create a sub-domain over that dense domain. But has the issue that there might be many blocks in certain locales that have no/few nnz entries. So is there a way where I could, potentially distribute only the nnz entries in the sparse tensor for better packed efficiency?

As an aside, any other information as to how sparse arrays/domains are implemented and distributed in general would also be useful. I could also provide more specific information if required.

The way you describe is the way to create sparse domains and arrays in Chapel.
Distributing only nnz entries is not supported by any of the domain maps we have
today, but IIRC, there has been some discussions about supporting it.

A somewhat obvious and probably undesirable path you can take is wrapping two
block distributed arrays in a class/record and storing indices in one of them
and array data in the other. Therefore, you wonâ€™t be using any Chapel sparse
domains/arrays, but would be implementing your own. (And thus, undesirable) I am
not sure how youâ€™d distribute CSR domain, though.

I imagine more efficient distributed sparse domains can be implemented as user
defined domain maps, which is an advanced feature of the language:

The sparse primer that you may have already come across:

See â€śSparse Domainsâ€ť here:

Feel free to elaborate more on your use case in case there is something better
you can do with Chapel as it stands today. Also, feel free to open a GitHub
issue if that feature is important to you.

Tagging onto Enginâ€™s response, Iâ€™d say that Chapel was designed to support arbitrary distributions and implementations of sparse domains and arrays, but that we donâ€™t support very many cases today (where itâ€™s a standard chicken and egg problemâ€¦ not many people have used those features, so we havenâ€™t spent much time on them; and because we havenâ€™t spent much time on them, they arenâ€™t as built out as other features, so people donâ€™t use them).

Turning the question around on you a bit: If you have a specific vision in mind, could you describe more about how you would like to (a) represent the sparse indices and elements and (b) characterize that representation as a client? E.g., do you want to linearize the nnz elements in some way and then chunk those up equally as Engin described? Or do you want to use a variant on block-distributing the bounding box of the sparse indices that simply doesnâ€™t use evenly-sized blocks? Or something else?

Iâ€™ll also note that while writing a new domain map is a fairly significant endeavor, since itâ€™s Chapel code, anything that is done to prototype the explicit representation of the indices is a useful starting point, such as Enginâ€™s suggestion to create an object that behaves domain-like or array-like. And Iâ€™ll mention that Iâ€™ve been meaning to write a friendlier â€śSo you want to write a domain map?â€ť guide that takes the user through the process step by step, so if that would be useful, thatâ€™s something I could bump up in priority in November.

The current usecase would be implementation of MTTKRP for sparse tensor decomposition. Hence, we were initially looking at an implementation of the bounding box for the sparse tensor indices that would equally distribute the nnz elements in the tensor. But as of now we have reverted back to the equally sized distribution of the tensor blocks, so I believe you could keep the guide as a low priority for now, though it may be useful in the future. Though to answer your question, as a user, I would have liked to have a variant on block distribution of the sparse matrix/tensor that would use blocks that would balance out the nnz elements equally in each block.

As a follow-up question, what would be the best way to sample nnz elements per processor, assuming that the sparse tensor/matrix is block distributed? For the zero based elements, I could potentially just calculate the smaller bounding box for that processor and sample random indices, then check if it is a valid zero element. But for the nnz elements, the only way I can think of would be to iterate over all the nnz elements and then manually store then in lists , after which I could sample, but that would be inefficient. So if thatâ€™s the only way or if we could do something better with the current functionalities?

Lastly, As an aside to the Distributions, I had another ( hopefully simpler )question. Say I wanted to distribute a matrix( the factor matrices in this case ) into chunks of rows across processors, how would I do that with the current implementation? Given a matrix of size MxR and P processors. I would like to distribute (M/P) rows on each of the processors. I have tried block distribution, but that distributes across the rows as well, but since I want the entire row to be present in one processor this wouldnâ€™t be valid. Alternatively I have tried Block Cyclic as well, so while that does assign an entire row to a processor , I wasnâ€™t able to get a valid distribution that uses all the processors. So any leads regarding this would be useful as well.

Thanks for the additional information. In a distribution that evenly blocked the nnz elements, would you imagine them to be sorted/ordered in some way before being blocked, or not? Or would you imagine the distribution would be unaware of that and would leave it to the user to supply the nnz elements in whatever order they wanted? As Engin mentioned before, I think the best way to start in this direction would be to use Chapelâ€™s 1D dense block arrays to represent the nnz elements and sparse structure explicitly and to wrap them in an object that would act sparse matrix-like. If this prototype were successful, it would be a good first step toward writing a new domain map that behaved similarly.

The way to distribute a 2D array in one dimension in Chapel is to use the targetLocales argument that most distributions take to pass in a numLocales x 1 or 1 x numLocales array of locales as the target. The distribution will respect this arrangement of the locales in distributing the 2D set. When the argument is not supplied, the distributions use a heuristic to create a square-ish view of the locales by default.

Thank you for the clarity regarding the matrix distribution, that makes sense. I had gone through the various types and I was aware of the heuristic squishing and hence, was trying different permutations of the 1D locales, or reshaping the Locales in the same dimensional space as the matrix.

Also, I had updated my previous post with a follow-up question which I will add here for continuity.
â€śWhat would be the best way to sample nnz elements per processor, assuming that the sparse tensor/matrix is block distributed? For the zero based elements, I could potentially just calculate the smaller bounding box for that processor and sample random indices, then check if it is a valid zero element. But for the nnz elements, the only way I can think of would be to iterate over all the nnz elements and then manually store then in lists , after which I could sample, but that would be inefficient. So if thatâ€™s the only way or if we could do something better with the current functionalities?â€ť

Lastly, for the evenly blocked nnz elements use case, as a user , I would prefer having to first explicitly declaring the sparse domain to distrubute the nnz elements, and then the user could add the nnz elements in a way they would like(as is the case with the current implementation). But that could potentially be complicated from a development point of view as you may have to handle cases of re-balancing the nnz across the processors. So the other alternative as you mentioned may be more feasible, but in that case the distribution would have to happen after the indices are allocated, which doesnâ€™t seem to be possible for any of the current distributions.

Thanks for repeating the question, which I had missed. Iâ€™m not sure Iâ€™m understanding it, though. Are you asking how, given the current Block-distributed sparse arrays, you could query just the indices (or values?) or the nnz elements that are mapped to a given processor? If not, maybe you could restate the question. Iâ€™m specifically stumbling over what you mean by â€śsampleâ€ť and whether you consider â€śnnzâ€ť to be the total number of non-zeros in the array, or the number mapped to a single locale.

I think your last paragraph touches on the inherent challenges of writing a distribution that is sensitive to the specific location of non-zeroes in a sparse array: where do you store them prior to creating the distribution (in a way thatâ€™s scalable)? or, if you canâ€™t afford to do that, how do you reason about their values well enough to do a good job with the distribution? This is why I turned the question back around on you on just what you wanted to do in my original response. I can imagine a lot of ways people might want to store and populate a sparse array, but wasnâ€™t clear what you want to do (and am still not particularly certain).

Since code is more precise than English, could you sketch out some pseudo-code or pseudo-Chapel illustrating what youâ€™d ideally like to do? I think that might make the use case youâ€™re striving for clearer for me (where itâ€™s still not particularly). Part of the problem here may be that Iâ€™m not particularly familiar with MTTKRP.

The problem is as such, given a given a sparse tensor calculate itâ€™s tensor decomposition. The core algorithm for which is the MTTKRP operation.

The core equation of a MTTKRP operation can be roughly defined as,
A_{ij} = X_{ikl} â‹… D_{lj} â‹… C_{kj}
where X is the original 3 dimensional tensor and where A,D,C are factor matrices.

For simplicity you can think of it as the Tensor analogue for SVD. The following link provides the equation above. ( But is not the best reference, nor has anything to do with TACO , so please only refer the equation from the link below )

To do this efficiently, it is important how the non-zero elements are distributed across the processors,so that the above operation and similar operations can be performed efficiently. Hence, in the problem stated, the sparse tensor would be possibly initially stored on a single processor(locale), and then distributed if possible( Similar to a Scatter operation ). As of now, we are still deciding on that, but itâ€™s assumed that we have a sparse tensor. So whether to create an sparse tensor with predefined distribution or distribute it later is still being decided, and are open to suggestions as to which would be easier( or natural ) to do in the language.

Regarding the sampling I mentioned previously let me rephrase. I am asking that given the Block Distributed Sparse Tensor, how would I be able to say given an array list of indices of nnz values, sample( or generate ) a sub-tensor with those indices only. The following pseudo code may be able to explain this better.

X = sparse_distributed_tensor() //Original tensor
X_sub = sparse_distributed_tensor() //Currently Empty
for k in locales:
X_k = X[k] //This is just to represent the tensor block in locale k
indices = X_k.nnz.list()
rand_ind = rand(indices) / random list of indices of nnz in locale k
X_sub_k = X_k[rand_ind]
X_sub[k] = X_sub_k
process(X_sub)

Hope this clarifies the problem statement and the current task at hand. Please let me know if something is unclear. Note: Excuse my liberal usage of different styles of notations, but my intention was to be provide more clarity.

Thanks for the additional descriptions, and sorry for the delayed response. We've had a number of big deadlines recently, and I'm still digging out.

OK, thanks. If you can afford to store the sparse tensor on a single locale to begin with, then I think you could do fairly arbitrary things (manually today, or ultimately using a user-defined sparse distribution) to inspect it and determine how to best create a distributed version of it. This seems logically straightforward to me, but of course could create a scalability issue by not supporting the ability to run on a sparse tensor that was too large for a single node's memory.

If you can't afford to store the sparse tensor on a single locale, that's where I was looking for more information before suggesting anything: I.e., how would you want to decide how to distribute the index space without being able to store it all locally / and how would you want to store the nnz's while you were figuring that out? (I can imagine a number of ways of doing this, so was curious which you were interested in pursuing). But we don't need to go down that path for now if you can assume it will fit in a single node's memory (or know it always will).

Thanks for the pseudo-code for the sampling you were asking about which was very helpful to me. I think I could just about transliterate this into Chapel for you except for the following line whose meaning I didn't quite understand:

Namely, what does rand() do / what is the type of rand_ind?

Thanks, and Iâ€™ll try to get the next response in this conversation to you more quickly. We could also jump onto a videoconference at some point if that would be helpful / save time.

Yes, at this point I would like to start off with the assumption that the sparse tensor will be able to fit on a single node. Regarding you next question, by â€śrand()â€ť it was just meant to randomly sample random indices from a the list of indices. So to simplify all I would require would be,

X_sub_k = random_nnz_elements(X_k)

For ex: If X_k = [ (1,1,4) , (1,2,3) , (2,3,1) , ..... , (3,4,3) ] then X_sub_k = [ (1,2,3) , (2,3,1) , ... ]

And sure It would be simpler and faster to talk over videoconference. How should I contact you regarding the same?

OK, so I failed at that... I did spend some time today sketching out some code in Chapel to capture your pseudocode above, but haven't completed it yet. One thing I still don't understand about your rand() routine is how it's decided how many random elements to return.

[For those following along: Offline, @miteshsk clarified that the rand() routine would also take an argument indicating the number of samples to generate]

Iâ€™ve also been continuing to translate Miteshâ€™s pseudocode to Chapel and will post the result here once I have something worthwhile.

In doing so, Iâ€™m running pretty frequently into limitations in Chapelâ€™s support for block-distributed sparse arrays that Iâ€™m capturing as issues. Itâ€™s a little disappointing to keep writing things that donâ€™t work as intended, but none of these are particularly surprising given how little attention distributed sparse arrays have received in Chapel.

The following is not yet a complete transcription of Miteshâ€™s computation, but it shows some basic manipulations of local and distributed sparse domains, and is my stopping point for the weekend. Iâ€™ve used workarounds for the issues above:

use BlockDist;
config const n = 10;
/* Declaring dense and spares domains x non-distributed and distributed */
// A dense and sparse domain stored on locale 0
const Dom = {1..n, 1..n};
var SparseDom: sparse subdomain(Dom) = [i in 1..n] (i,i);
writeln(SparseDom);
// a distributed sparse domain, initially empty
const DistDom = Dom dmapped Block(Dom);
var DistSparseDom: sparse subdomain(DistDom);
/* Assigning a local sparse domain to a distributed one: */
// Should be able to do:
//
// DistSparseDom = SparseDom;
//
// but it isn't supported yet (issue #16770), so using a workaround:
assignDefaultSparseToDistSparse(DistSparseDom, SparseDom);
/* Printing a distributed sparse domain: */
// Should be able to do:
//
// writeln(DistSparseDom);
//
// but it isn't supported yet, so use a workaround:
writeDistSparseDomUnordered(DistSparseDom);
/* Clear out the distributed sparse domain, arbitrarily */
DistSparseDom.clear();
coforall loc in Locales with (ref DistSparseDom) {
on loc {
// Let's see what our local bounding box for sparse indices is
const myLocDenseInds = DistDom.localSubdomain();
writeln("locale ", loc.id, " owns indices ", myLocDenseInds);
//
// localize the non-distributed sparse domain's array of nonzeroes
// to avoid a communication penalty for continually accessing it
// over the network
//
var localizedSparseInds = SparseDom.arrayOfInds();
writeln("[", here.id, "] Here's my local copy of all the indices: ",
localizedSparseInds);
//
// filter down to the indices that this locale specifically owns
//
var myLocSparseInds = [idx in localizedSparseInds]
if myLocDenseInds.contains(idx) then idx;
writeln("[", here.id, "] Here are the ones that belong to me: ",
myLocSparseInds);
// I could then randomly downsample that array here...
// Then I can add them back into the distributed domain:
DistSparseDom += myLocSparseInds;
// Let's make sure what I own is correct:
writeln("Locale ", here.id, " owns: ", DistSparseDom.localSubdomain());
}
}
writeDistSparseDomUnordered(DistSparseDom);
proc _domain.arrayOfInds() {
return this._value._indices[1..#SparseDom.getNNZ()];
}
proc assignDefaultSparseToDistSparse(ref Dist, Def) {
Dist += Def._value._indices[1..#SparseDom.getNNZ()];
}
proc writeDistSparseDomUnordered(D) {
forall ij in D do
writeln("locale ", here.id, " owns index ", ij);
}

Thank you for the follow-up and the code snippet. I would like to add that if the DistributedSparse Domain is already pre-filled, then it can just be accessed using DistSparseDom.localSubdomain() directly which would give the non-zero indices for that locale.

In any case, as of now, my query is clarified regarding sampling, so thank you for your responses.