New Issue: [Patterns] distributed histogramming

19102, "lydia-duncan", "[Patterns] distributed histogramming", "2022-01-25T19:55:28Z"

This issue is part of a series of issues to design the interface for collections across serial, parallel, and distributed contexts. Our goal is to determine what we think is reasonable to support and what doesn't seem useful. Additional patterns welcome (but it would be best to put them in their own issue for discussion, likely). In particular, this pattern was suggested by Michael.

// suppose hist is a distributed map, input is a distributed array
forall key in input {
  hist[key] += 1;
}

This could cause a race depending on the underlying implementation. Adding a new key in another thread could cause the underlying data structure to resize, and that could make the reference to hist[key] in this thread become invalidated.

If we don't return a reference from hist[key], it becomes a lot harder to implement a histogram using this map. If we don't allow resizing to impact already existing keys, we still could run into reference invalidation if a key is removed in a parallel setting (solutions 1 and 2 below would avoid that case).

Michael had a few ideas for what to do if we don't return a reference from hist[key]:

  1. We could pass in a lambda/function object to perform the update, something like Count.update(key, lambda(x) { x += 1; } )
  2. We could use a context manager to wrap the update: with Count.update(key) as val { val += 1; }
  3. We could add a lock per key and make the user deal with atomicity
  4. We could add a feature like the one proposed in #12306