External Issue: .contains() on dense/block domain slower than manual bounds check

17884, "thomasrolinger", ".contains() on dense/block domain slower than manual bounds check", "2021-06-08T12:40:54Z"

Summary of Problem

The .contains() procedure is slower than an equivalent manual bounds check on a dense/block domain. By manual bounds check, I mean just checking whether the index in question is between the start and end index of the block domain. It was recommended that this be brought up, as it may not be the expected behavior.

Steps to Reproduce

Here is the full code: contains_ex.chpl.zip

Below is an excerpt from the code to highlight what's going on. The use-case is determining whether a given index would be a remote access into a block distributed array (part of the work presented at CHIUW 2021 https://chapel-lang.org/CHIUW/2021/RolingerSlides.pdf).

In the real application, the checks would be issued from different locales, but that isn't necessary to reproduce the performance behavior; the checks never require remote communication. All that matters is we do the manual checking, or use .contains(). So the attached code will execute only from locale 0. We require multiple locales to test since we want to have a realistic distributed array to "inspect".

The motivation for using .contains() is for cleaner code as well as the allowing the optimization to support distributions other than BlockDist, as long as the distribution supports the localSubdomain query.

Source Code:

// arr is a 1-d dense block distributed array

// bounds is a block-distributed array (1 element per locale) where each element is a (int,int)
// that corresponds to a locale's start/end index of its local partition of arr.

// indices is a 1-d array of random indices into arr; it is stored explicitly 
// on locale 0 (i.e., not distributed).

// Manual bounds checking method
const start = bounds[here.id](0);
const end = bounds[here.id](1);
for idx in indices {
   if idx < start || idx > end {
      // idx would be a remote access
   }
}

// Using .contains instead.
// locale_doms is a block-distributed array (1 element per locale) where each element is `domain(1)` 
// that corresponds to a locale's local subdomain of arr.
const ref locale_dom = locale_doms[here.id];
for idx in indices {
   if !locale_dom.contains(idx) {
      // idx would be a remote access
   }
}

Compile command:
chpl --fast contains_ex.chpl

Execution command:
./contains_ex -nl <num_locales> --n=<number of elements/accesses to make>

Expected output:
The program will output two lines, each containing elapsed time for the bounds checking approach or the .contains() approach. It also mentions how many remote accesses were detected (again, no remote accesses are actually issued); this just serves as a validation to show that both approaches are doing the same thing.

On my system (20 cores per node) I get the following output on 4 locales with --n=10000000

### Bounds checking: 0.035340 seconds (number of remote accesses=7499391)
### Contains:        0.073172 seconds (number of remote accesses=7499391)

Configuration Information

  • Output of chpl --version: chpl version 1.24.1 built with LLVM version 11.0.1
  • Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: gnu
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: ibv *
  CHPL_GASNET_SEGMENT: large
CHPL_TASKS: qthreads *
CHPL_LAUNCHER: gasnetrun_ibv *
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc *
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled *
CHPL_HWLOC: bundled *
CHPL_REGEXP: re2
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none
  • Back-end compiler and version, e.g. gcc --version or clang --version: gcc (GCC) 8.2.0