Hi Tom —
Here what I think is a functional, but non-optimized implementation. Whether or not its performance is sufficient will depend on the number of locales, buffer sizes, and degree to which time is spent in these routines vs. other parts of the code that may be more computationally intensive.
Due to the use of CTypes
, this won't work in versions prior to 1.26.0 (though if you change that to CPtr, SysCTypes
then I think you should be OK on older versions).
use AllLocalesBarriers, CTypes;
// A buffer located on locale 0 to help with the broadcast
var tmpBuffDom = {0..0:c_int};
var tmpBuff: [tmpBuffDom] real;
proc broadcastReal(buffer: c_ptr(real), count: c_ptr(c_int)) {
const n = count.deref(),
inds = 0..<n;
if here.id == 0 {
// grow the temp buff if it's not big enough
if n > tmpBuffDom.size then
tmpBuffDom = {inds};
// copy locale 0's data into the buffer
forall i in inds do
tmpBuff[i] = buffer[i];
}
// wait until locale 0's got tmpBuff set up before proceeding
allLocalesBarrier.barrier();
// Locale 0 already has the data so doesn't need to do anything
if (here.id != 0) then
forall i in inds do
buffer[i] = tmpBuff[i];
}
// A buffer of atomics on locale 0 for computing the reduction
var atomicBuffDom = {0..0:c_int};
var atomicBuff: [atomicBuffDom] atomic real;
proc globalSumReal(sendBuf: c_ptr(real), recvBuf: c_ptr(real),
count: c_ptr(c_int)) {
const n = count.deref(),
inds = 0..<n;
// grow the temp buff if it's not big enough
if here.id == 0 then
if n > atomicBuffDom.size then
atomicBuffDom = {inds};
// Make sure locale 0 has had the chance to resize before proceeding
allLocalesBarrier.barrier();
// have all locales atomically add their results to the atomicBuff
forall i in inds do
atomicBuff[i].add(sendBuf[i]);
// Make sure all locales have accumulated their contributions
allLocalesBarrier.barrier();
// Have each locale copy the results out into its buffer
forall i in inds do
recvBuf[i] = atomicBuff[i].read();
}
And here's a test of the code:
// Test the routines
coforall loc in Locales {
on loc {
const locid = here.id;
var data, data2 = [(locid+1)/10.0, (locid+1)*1.0, (locid+1)*10.0];
var count = data.size: c_int;
// Test the broadcast
writeln("[", locid, "] Before bcast: ", data);
broadcastReal(c_ptrTo(data), c_ptrTo(count));
writeln("[", locid, "] After bcast: ", data);
// Test the reduce
globalSumReal(c_ptrTo(data2), c_ptrTo(data), c_ptrTo(count));
writeln("[", locid, "] After reduce: ", data);
}
}
Let us know whether this works for you, whether you find any mistakes in it, and whether the performance seems reasonable or not.
-Brad