Vectorized and parallell loops. Can I have both?

This is more a question to clarify than an issue. I just wonder how the chapel loops works.

  • forall tells chapel to run the code on multiple tasks.
  • foreach tells chapel that a loop is unordered.
    Can I have both at the same time?

As an example is a trival sum aggregation. I can imagine a code block such as

proc sum(values) {
var count : atomic uint(32) = 0;
forall i in values.domain do
count.unorderedAdd(values[i]);
return count.read();
}

In general this example is a good candidate for speedups. (with or without atomic operations) I can imagine that the compiler can trivially optimise loops with SIMD-instructions.
At the same time there are clear speed benefits in using all those tasks you have at disposal on HPC-systems.
I would really appreciate a rule of thumb for when I choose foreach and when I choose forall.

Hi Andreas —

I'd say that the rule of thumb for using forall vs. foreach is whether you want to use multiple tasks (forall) or not (foreach). Apart from that, they are intended to be fairly interchangeable, though there are some loose threads with foreach due to it still being relatively new (most notably—its support for with clauses is not yet complete, there are some nagging questions about whether it should insert shadow variables more similarly to forall, and we don't support expression-level foreach loops yet).

Note that there's nothing about a forall loop that automatically creates tasks or even necessarily creates them. Instead, forall essentially says "Invoke the parallel iterator(s) for this iterand." As a result, its the definition of the parallel iterator being invoked that says how many tasks to create, where to run them, how to divide the work between them, etc. As a result, some forall loops may not create new tasks, heuristically. For our built-in types (ranges, domains, arrays), their parallel iterators are typically written such that the innermost loops that each task runs is a foreach, such that you should effectively get the benefits of both multi-tasking and vectorization (say) when applicable.

Let us know if this doesn't answer your question or raises additional ones,
-Brad

PS — Note that you can format Chapel code here on discourse using ```chapel ... ``` environments, like:

proc sum(values) {
  var count : atomic uint(32) = 0;
  forall i in values.domain do
    count.unorderedAdd(values[i]);
  return count.read();
}

It's not terribly fancy (yet), but better than nothing.

It think that cleared it up.

I just is a good starting point for more learning on iterators. More specifically when forall will spawn new tasks. I can figure this part out.

The thing I have har harde time is with the --vectorize flag in the chapel compiler, and how this relates to foreach loops. I don't think the challenge here is the theory but rather how I can peek inside the black box. I want to se the effects of the code I am doing. I am happy if I can see that my loops are done in with any kind of SIMD instruction.

So to get to know this i ran the simplest experient. I wrote in chapel and c and tried to write the same programe. I worte the simplest program I could imagine.

First I wrote a program in c. I verified that clang for my mac does a decent job unrolling and vectorising.

My program looks like

#include <stdio.h>

void sum() {
    int A[256];
    int B[256];
    for ( int i = 0; i < 256; i++ )
    {
        A[i] = B[i] + 1;
    }

    printf("%d", A[0]);
}

int main() {
    sum();
    return 0;
}

I compiled with clang -O3 -march=native -Rpass=loop-vectorize -S Poc.c and I got a program alog with this compiler output.

Poc.c:6:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
    for ( int i = 0; i < 256; i++ )
    ^
Poc.c:6:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]

Clang tells me that this loop was vectorised. This is great feedback. I can imagine this loop to be 8 times faster than sequential loops.

Then I tried my best to do the same with chapel.

/* Documentation for Poc */
module Poc {
  writeln("New library: Poc");

  proc sum() {
    var A : [0..255] int(32);
    var B : [0..255] int(32);

    for i in A.domain do
      A[i] = B[i] + 1;
  }

  proc main() {
    sum();
  }
}

I assumed I can send the same input to clang under the hood of the chapel compiler. So I compiled with
chpl src/Poc.chpl --fast --target-compiler=llvm --ccflags -Rpass=loop-vectorize --print-commands --explain-verbose . The important parts is how I explicitly state --target-compiler=llvm to use the correct compiler, and --ccflags -Rpass=loop-vectorize to get my loops vectorized.

The out does not help me much. I can't see what is going on.

This loop is so simple that it should be trivial to run faster than sequential code. It almost asks for optimisations. For I know it might be optimal. Unfortunatly I can not see how the compiler treats this loop.
What chpl wrote to stdout is listed below.

<internal clang code generation> -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/standard -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/packages -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/localeModels/flat -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/localeModels -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/comm/none -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/comm -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/tasks/qthreads -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/qio -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/atomics/cstdlib -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/mem/jemalloc -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/utf8-decoder -DCHPL_JEMALLOC_PREFIX=chpl_je_ -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/jemalloc/install/target/darwin-x86_64-native-llvm-none/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/include -I. -I/var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l -O3 -DCHPL_OPTIMIZE -march=native -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/internal -Rpass=loop-vectorize -DCHPL_GEN_CODE -pthread -I/usr/local/include -include sys_basic.h -include ctype.h -include wctype.h -include llvm/chapel_libc_wrapper.h

# Make Binary - Linking
/usr/local/Cellar/llvm/13.0.1_1/bin/clang++ /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/chpl__module.o /usr/local/Cellar/chapel/1.26.0/libexec/lib/darwin/llvm/x86_64/cpu-native/loc-flat/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none/main.o -o /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/Poc.tmp -L/usr/local/Cellar/chapel/1.26.0/libexec/lib/darwin/llvm/x86_64/cpu-native/loc-flat/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none -lchpl -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -lhwloc -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/lib -lqthread -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -lhwloc -lchpl -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/jemalloc/install/target/darwin-x86_64-native-llvm-none/lib -ljemalloc -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/lib -lre2 -lgmp -lm -lpthread -L/usr/local/lib
rm -f Poc
mv /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/Poc.tmp Poc

For the curious readers

Hi Andreas —

I'm not an expert here (so will rely on others like @mppf to correct me), but I believe that if --no-vectorize is used at compile-time, it will squash passing along any information about the vectorizability of the loop to the back-end LLVM compiler. To that end, it might be viewed as treating any foreach loops as though they were simply for loops, such that they'd only be vectorized if the back-end compiler could prove to itself that it was reasonable to do so. In contrast, when --vectorize is used (the default), we pass additional information to LLVM for foreach loops to indicate their potential for vectorization. As the man page for --vectorize suggests, there are no guarantees that a given loop will be vectorized since it depends on whether or not the back-end compiler can effectively do so, where different compilers have different capabilities depending on the version and target processor.

So to get to know this i ran the simplest experient. I wrote in chapel and c and tried to write the same programe. I worte the simplest program I could imagine.

In your C vs. Chapel comparison, note that the code generated for the Chapel arrays will be more complex than the C arrays in that Chapel's arrays are heap-allocated and support richer indexing capabilities, such as indices that may not be 0-based or may be multidimensional. While one might imagine that we could optimize small, 0-based, 1D arrays to have the same implementation as C, we don't do this today.

Similarly, looping over A.domain will result in additional indirection / lookup of loop bounds relative to looping over 0..255 in Chapel or C.

Due to this additional complexity, a foreach loop may be necessary to give the back-end compiler enough information to generate competitive code with C (though it may not be necessary).
In either case, it would certainly be the recommend approach for writing a simple intended-to-be-vectorizable computation like this one in Chapel.

Alternatively, you could make these programs more apples-to-apples by using Chapel homogeneous tuples, like var A, B: 256*int(32); which should result in virtually the same declarations as C.

To my knowledge, when we use the LLVM back-end, the --ccflags flag has no effect since we're not invoking a C compiler to generate code, just a C/C++ linker (which is the clang++ line in your output above). If I'm correct in that, arguably chpl should issue a warning when --ccflags is specified, but not utilized. This is another place where it would be good to have someone like @mppf or @diten double-check me.

Though the LLVM back-end is the preferred choice for back-end these days, you may want to rebuild with CHPL_TARGET_COMPILER=clang, use the --savec <dir> flag to preserve the generated C code for inspection, and then --ccflags to invoke clang in the same way as with your C compilation step. I'd be curious whether this gets you closer to the comparison that you're looking for.

how I can peek inside the black box. I want to se the effects of the code I am doing. I am happy if I can see that my loops are done in with any kind of SIMD instruction.

Unfortunately, for the preferred LLVM back-end, I don't believe we have a very good user-facing story here today. When our team has wanted to understand the code we're getting out of LLVM, I believe we've typically been disassembling the resulting binaries. For example, there's a capture of one of our devs' analysis of one of our Computer Language Benchmarks Game entries here, where he used llvm-objdump to look at the resulting assembly: https://gist.github.hpe.com/andrew-consroe/67a28ab5a52d0396d19d100ef8745caf

I'll also mention that Michele Weiland and Ricardo Jesus at U. Edinburgh have started looking at Chapel vectorization and are giving a talk about their early experiences at CHIUW 2022 next week, so may have additional tips or techniques to suggest. Ultimately it would be nice to wire our compiler up to LLVM in such a way that it could generate additional information about how loop nests are implemented, similar to what the Cray CCE compiler does (or your clang-generated remarks above), but we're not there yet, unfortunately.

Best wishes,
-Brad

It was menmtioned

... note that the code generated for the Chapel arrays will be more complex than the C arrays
in that Chapel's arrays are heap-allocated and support richer indexing capabilities, such as
indices that may not be 0-based or may be multidimensional. While one might imagine that
could optimize small, 0-based, 1D arrays to have the same implementation as C, we don't do
this today.

I have not done truly rigorous test, but in the limited testing I did in one single application, an SVD, Chapel's indexing with 1-based 1D arrays was competitive to C's indexing with 0-based 1D arrays. The length was 50..5000.

My 2c.

I can confirm that 1D dense dimensions has the same performance characteristics software written in c.

I went back to to my two programS and decompiled them to intel assembly (for the C program used -S to get assembly directly while in the chapel program I did objdump -d target/release/Poc > Poc.dump. In both cases I found repetitive assembly instruction vpsubd and vmovdqu . I don't know how with 100% certainty that is going on inside this assembly. However I do know that SIMD instructions from the AVX or AVX2 instruction set did does the execution.

So far I have found a way to analyse my vector instructions with objdump. I can cut and paste the kernel I am focus on into simple example programs and read the assembly. This way I can experiment with chapel until I am pleased with the results. :slight_smile:

It is a bit clumbersome, and would love to see simpler analysis techniques in the future.

I am looking forward to see Michele Weiland and Ricardo Jesus talk online after the event. I hope to learn something.

PS: I tried --savec <dir> also. However the chpl__module-nopt.bc i got out of it was hard to analyse. I passed it into llvm-bcanalyzer, but I did not really understand much of what was going on from the results.

1 Like

I really don't know about llvm-bcanalyzer. FWIW, I typically disassemble the bitcode file with llvm-dis and inspect the LLVM code.

1 Like

I have never figured out how to isolate the assembler associated with a particular routine.

Are there are suggestions?

In C, I just isolate the routine and compile with '-S'.

I know I'm a bit late to this thread but I'll reply with some things that I know.

If you write a loop with foreach or forall, you are saying that the body of that loop can be run in any order.

The situation today with the production compiler is that the LLVM vectorizer supports C and is quite good at vectorizing whether we give vectorization hints or not in the LLVM IR. So in fact I would not expect --vectorize or --no-vectorize to matter, but it is not something I have experimented with recently. This situation is different with our experimental integration with RV, which can vectorize outer-loops, and heads more towards vectorizing anything and everything that the Chapel compiler hints as vectorizable. We would like to get further at getting this RV work into production but lately we have been focusing on getting started with GPU support instead. I still expect we will get back to it at some point, though.

When using the LLVM backend, the --ccflags are processed as arguments to an embedded run of clang that we use. So, clang flags can be provided here, but they don't always work as expected, because we're not running clang as usual.

+1. It is a bit complicated because you might want to know if Chapel thought the loop was vectorizeable and you might want to know if later LLVM passes actually vectorized it.

One other trick that I sometimes use is to run the program in the debugger and hit Control-C to pause it after it has had a chance to run for a bit. This makes it likely that you will end up in code that is running frequently (whatever code is taking most of the time is most likely where you have stopped). You can use bt to see where that is and do the experiment a few times to identify the places where it is spending most time. Then you can use commands like x/50i $rip-40 to print out the instructions around the current instruction pointer.

When working with the LLVM backend there are actually 2 things that you might be interested in looking at in this kind of situation:

  1. The LLVM IR after optimization
  2. The machine code / instructions after optimization

For 1, you can use --llvm-print-ir <function-name> --llvm-print-ir-stage full to see the IR after optimization. E.g. for the module Poc and sum that was posted earlier, I compiled with chpl poc.chpl --fast --llvm-print-ir sum --llvm-print-ir-stage full and therein I can see this bit:

  %87 = load <8 x i32>, <8 x i32>* %86, align 4, !tbaa !47, !alias.scope !26, !noalias !23
  %88 = add nsw <8 x i32> %78, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %89 = add nsw <8 x i32> %81, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %90 = add nsw <8 x i32> %84, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %91 = add nsw <8 x i32> %87, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %92 = bitcast i32* %75 to <8 x i32>*
  store <8 x i32> %88, <8 x i32>* %92, align 4, !tbaa !47, !alias.scope !23, !noalias !26

where the <8 x i32> type in LLVM IR is a vector type (it's just <width x element type>) and you can see also here that there must have been some loop unrolling because the are 4 8-way vector adds next to each other, so it is doing 32 increments per loop iteration.

One oddity here is that when using --llvm-print-ir, in order to be able to print out the function after optimization, the compiler requires it to definitely not be inlined. It is unlikely to matter for performance in this case but it could matter in another setting. When you are trying to understand loop vectorization, I would expect things to make sense if you use --llvm-print-ir on a function containing a loop (rather than, say, a function called within the loop) .

For 2, I don't currently have a better answer than the dumping instructions from gdb or from objdump. Of course it would be possible, with scripting and/or careful use of objdump flags, to get out just the instructions for a particular function. I think it would be interesting to add a flag to chpl to do this in a way similar to --llvm-print-ir but with the machine code. But we don't have that yet. I would welcome a contribution in this area if anybody wanted to script up what objdump commands / scripting the compiler would run to dump a symbol with a particular name.

2 Likes