Implementing analogues of MPI_Allreduce and MPI_Bcast

I'm trying to use PRIMME library (GitHub - primme/primme: PReconditioned Iterative MultiMethod Eigensolver for solving symmetric/Hermitian eigenvalue problems and singular value problems) in my code. The library was written with MPI in mind, but it has some "hooks" such users can choose their own implementation of collective operations. Specifically, I have to implement the following two functions:

int globalSumReal(double const *sendBuf, double *recvBuf, int *count);
int broadcastReal(double *buffer, int *count);

globalSumReal is very similar to MPI_Allreduce. It should perform a global sum (across locales) of sendBufs and store the result in recvBuf.

broadcastReal is very close to MPI_Bcast. It should broadcast an array from locale 0 to all others.

Initially, I thought this should be trivial to do in Chapel, but now I realize that it might not be so easy... Do you have any suggestions on how to approach this problem? or any resources I can use for inspiration?

Thanks!
Tom

Hi Tom —

Just to clarify your question a bit:

  • Am I correct in assuming that to implement one of these operations, the PRIMME library will make one call into the corresponding routine per locale?
  • Do you know why the count arguments are pointers? (In the broadcast case, does locale 0 send in the size, and all other locales don't know the size, so it's returned to them? I can't guess why globalSumReal() would use a pointer, though...)

Thanks,
-Brad

Hello Brad,

  • Yes, that is exactly right. One can implement globalSumReal more or less like this:
MPI_Allreduce(sendBuf, recvBuf, *count, MPIU_REAL, MPI_SUM, MPI_COMM_WORLD)
  • Perhaps, PRIMME is trying to ensure that it's easily callable from Fortran and makes all parameters pointers. I believe that count is meant to be used as an input parameter in both functions.

Thanks,
Tom

1 Like

Can we be guaranteed that there will not be simultaneous (in time) calls to these routines before the previous batch has completed? (e.g., multiple threads calling into them simultaneously). Can we also assume that each process will make the matching calls in the same order?

It seems virtually certain, or else they'd need some sort of ID to associate them.

-Brad

Yes, we can assume both these things. The calls are supposed to be blocking, and I believe PRIMME is not using multithreading internally. Furthermore, there is a possibility to pass some per-locale global state to these function, if we need to.

Hi Tom —

Here what I think is a functional, but non-optimized implementation. Whether or not its performance is sufficient will depend on the number of locales, buffer sizes, and degree to which time is spent in these routines vs. other parts of the code that may be more computationally intensive.

Due to the use of CTypes, this won't work in versions prior to 1.26.0 (though if you change that to CPtr, SysCTypes then I think you should be OK on older versions).

use AllLocalesBarriers, CTypes;

// A buffer located on locale 0 to help with the broadcast                      
var tmpBuffDom = {0..0:c_int};
var tmpBuff: [tmpBuffDom] real;

proc broadcastReal(buffer: c_ptr(real), count: c_ptr(c_int)) {
  const n = count.deref(),
        inds = 0..<n;

  if here.id == 0 {
    // grow the temp buff if it's not big enough                                
    if n > tmpBuffDom.size then
      tmpBuffDom = {inds};

    // copy locale 0's data into the buffer                                     
    forall i in inds do
      tmpBuff[i] = buffer[i];
  }

  // wait until locale 0's got tmpBuff set up before proceeding                 
  allLocalesBarrier.barrier();

  // Locale 0 already has the data so doesn't need to do anything               
  if (here.id != 0) then
    forall i in inds do
      buffer[i] = tmpBuff[i];
}


// A buffer of atomics on locale 0 for computing the reduction                  
var atomicBuffDom = {0..0:c_int};
var atomicBuff: [atomicBuffDom] atomic real;

proc globalSumReal(sendBuf: c_ptr(real), recvBuf: c_ptr(real),
                   count: c_ptr(c_int)) {
  const n = count.deref(),
        inds = 0..<n;

  // grow the temp buff if it's not big enough                                  
  if here.id == 0 then
    if n > atomicBuffDom.size then
      atomicBuffDom = {inds};

  // Make sure locale 0 has had the chance to resize before proceeding          
  allLocalesBarrier.barrier();

  // have all locales atomically add their results to the atomicBuff            
  forall i in inds do
    atomicBuff[i].add(sendBuf[i]);

  // Make sure all locales have accumulated their contributions                 
  allLocalesBarrier.barrier();

  // Have each locale copy the results out into its buffer                      
  forall i in inds do
    recvBuf[i] = atomicBuff[i].read();
}

And here's a test of the code:

// Test the routines                                                            
coforall loc in Locales {
  on loc {
    const locid = here.id;
    var data, data2 = [(locid+1)/10.0, (locid+1)*1.0, (locid+1)*10.0];
    var count = data.size: c_int;

    // Test the broadcast                                                       
    writeln("[", locid, "] Before bcast: ", data);
    broadcastReal(c_ptrTo(data), c_ptrTo(count));
    writeln("[", locid, "] After bcast: ", data);

    // Test the reduce                                                          
    globalSumReal(c_ptrTo(data2), c_ptrTo(data), c_ptrTo(count));
    writeln("[", locid, "] After reduce: ", data);
  }
}

Let us know whether this works for you, whether you find any mistakes in it, and whether the performance seems reasonable or not.

-Brad

Thanks a lot, Brad! Somehow this looks much easier than I expected...
I will test it out and let you know how it works.

Cheers,
Tom

I won't confess to the mistakes I made along the way then... :smiley:

-Brad