[Chapel Merge] Improve the speed of tracking arrays declared over

Branch: refs/heads/master
Revision: 1e8866a
Author: ronawho
Log Message:

Merge pull request #16740 from ronawho/use-dll-dom-arrs

Improve the speed of tracking arrays declared over a domain

[suggested by @gbtitus, reviewed by @e-kayrakli and @mppf]

Switch from using a set to a doubly-linked-list for tracking a domain's
arrays. In #15895 we switched the domain's list of arrays from a
singly-linked-list to a hashtable because singly-linked-lists have O(n)
removal time in general. For some common cases it was O(1) in practice
because removal order was the opposite of insertion order, but that's
not true for arrays-of-arrays where performance really suffered. That PR
eliminated some worst-case overheads for tracking arrays, but did hurt
the simple/serial case that was in-order and it increased compilation
time slightly because we're bringing in the hashtable/set into every
compile.

Here we switch to a doubly-linked-list where removal can be O(1) always
because we have a direct reference to the list "node" we want to remove.
This improves the tracking speed because the O(1) constant is faster
since there's no hashing or additional memory allocation anymore. It
also eliminates bringing in the hashtable so compilation time should be
a little better. This was suggested by Greg during review of our
release notes. For some reason using a doubly-linked-list hadn't
occurred to us previously, but in hindsight it's clearly a better and
simpler option.

In terms of implementation -- instead of using an external list data
structure like we used to, the list is directly implemented in the base
domain/array. Even if we had an external data structure, we'd have to
expose a reference to the "node" and store that in the base array in
order to support the O(1) removal (removal is only O(1) if you have the
reference to the element you want to remove) so here we just directly
implement it, which also avoids extra allocation.

Here's a micro-benchmark I'm using to evaluate this. It just creates an
array-of-arrays and has toggles for comparing serial vs parallel
performance and has a knob to compare var vs. const domains. This PR
improves the performance for var domains, but won't impact const since
we don't track the arrays for const domains. The const version is just
there as a "best case" baseline.

use Time;

config const doSerial = false;
config const constDoms = false;
config const outter = 5_000_000, inner=3;

var   vDO = {1..outter}, vDI = {1..inner};
const cDO = {1..outter}, cDI = {1..inner};

var t: Timer; t.start();
serial doSerial {
  if constDoms then var A: [cDO][cDI] int;
               else var A: [vDO][vDI] int;
}
writeln(t.elapsed());
config var before var after const
serial 3.19s 0.91s 0.87s
parallel 6.02s 3.45s 1.85s

So 3x faster when serial (which is ahead of the old performance before
#15895) and 1.5x faster when parallel. serial is pretty close to the
const performance, but the parallel has overhead still likely because of
longer lock contention.

Resolves https://github.com/Cray/chapel-private/issues/1398
Closes https://github.com/Cray/chapel-private/issues/1460

Modified Files:
M modules/dists/DimensionalDist2D.chpl
M modules/internal/ChapelArray.chpl
M modules/internal/ChapelDistribution.chpl
M modules/internal/ChapelHashtable.chpl
M modules/internal/DefaultRectangular.chpl
M test/modules/sungeun/init/printModuleInitOrder.good
M test/modules/sungeun/init/printModuleInitOrder.na-none.good
M test/optimizations/constDomain/domainField2.chpl
M test/optimizations/constDomain/literalIntoVar.chpl
M test/optimizations/constDomain/trackArraysFlag.chpl
M test/optimizations/deadCodeElimination/elliot/countDeadModules.comm-ofi.good
M test/optimizations/deadCodeElimination/elliot/countDeadModules.comm-ugni.good
M test/optimizations/deadCodeElimination/elliot/countDeadModules.good

Compare: Comparing 03cf9d9aad0c...1e8866a11a6a · chapel-lang/chapel · GitHub