Vectorized and parallell loops. Can I have both?

ahysing · June 3, 2022, 6:43am

It think that cleared it up.

I just is a good starting point for more learning on iterators. More specifically when forall will spawn new tasks. I can figure this part out.

The thing I have har harde time is with the --vectorize flag in the chapel compiler, and how this relates to foreach loops. I don't think the challenge here is the theory but rather how I can peek inside the black box. I want to se the effects of the code I am doing. I am happy if I can see that my loops are done in with any kind of SIMD instruction.

So to get to know this i ran the simplest experient. I wrote in chapel and c and tried to write the same programe. I worte the simplest program I could imagine.

First I wrote a program in c. I verified that clang for my mac does a decent job unrolling and vectorising.

My program looks like

#include <stdio.h>

void sum() {
    int A[256];
    int B[256];
    for ( int i = 0; i < 256; i++ )
    {
        A[i] = B[i] + 1;
    }

    printf("%d", A[0]);
}

int main() {
    sum();
    return 0;
}

I compiled with clang -O3 -march=native -Rpass=loop-vectorize -S Poc.c and I got a program alog with this compiler output.

Poc.c:6:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
    for ( int i = 0; i < 256; i++ )
    ^
Poc.c:6:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]

Clang tells me that this loop was vectorised. This is great feedback. I can imagine this loop to be 8 times faster than sequential loops.

Then I tried my best to do the same with chapel.

/* Documentation for Poc */
module Poc {
  writeln("New library: Poc");

  proc sum() {
    var A : [0..255] int(32);
    var B : [0..255] int(32);

    for i in A.domain do
      A[i] = B[i] + 1;
  }

  proc main() {
    sum();
  }
}

I assumed I can send the same input to clang under the hood of the chapel compiler. So I compiled with
chpl src/Poc.chpl --fast --target-compiler=llvm --ccflags -Rpass=loop-vectorize --print-commands --explain-verbose . The important parts is how I explicitly state --target-compiler=llvm to use the correct compiler, and --ccflags -Rpass=loop-vectorize to get my loops vectorized.

The out does not help me much. I can't see what is going on.

This loop is so simple that it should be trivial to run faster than sequential code. It almost asks for optimisations. For I know it might be optimal. Unfortunatly I can not see how the compiler treats this loop.
What chpl wrote to stdout is listed below.

<internal clang code generation> -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/standard -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/packages -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/localeModels/flat -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/localeModels -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/comm/none -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/comm -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/tasks/qthreads -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/qio -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/atomics/cstdlib -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/mem/jemalloc -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/utf8-decoder -DCHPL_JEMALLOC_PREFIX=chpl_je_ -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/jemalloc/install/target/darwin-x86_64-native-llvm-none/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/include -I. -I/var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l -O3 -DCHPL_OPTIMIZE -march=native -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/internal -Rpass=loop-vectorize -DCHPL_GEN_CODE -pthread -I/usr/local/include -include sys_basic.h -include ctype.h -include wctype.h -include llvm/chapel_libc_wrapper.h

# Make Binary - Linking
/usr/local/Cellar/llvm/13.0.1_1/bin/clang++ /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/chpl__module.o /usr/local/Cellar/chapel/1.26.0/libexec/lib/darwin/llvm/x86_64/cpu-native/loc-flat/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none/main.o -o /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/Poc.tmp -L/usr/local/Cellar/chapel/1.26.0/libexec/lib/darwin/llvm/x86_64/cpu-native/loc-flat/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none -lchpl -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -lhwloc -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/lib -lqthread -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -lhwloc -lchpl -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/jemalloc/install/target/darwin-x86_64-native-llvm-none/lib -ljemalloc -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/lib -lre2 -lgmp -lm -lpthread -L/usr/local/lib
rm -f Poc
mv /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/Poc.tmp Poc

For the curious readers

Topic		Replies	Views
CHAPEL performance test on WSL ubuntu Users	42	601	April 11, 2025
Code optimization Users	28	596	December 28, 2022
Chapel's programming model and features Users	3	334	July 6, 2023
Chapel and C/C++ performance difference Users	12	564	July 18, 2023
Array initialization by expression Users	15	343	March 13, 2023

Vectorized and parallell loops. Can I have both?

Related topics