It think that cleared it up.
I just is a good starting point for more learning on iterators. More specifically when forall will spawn new tasks. I can figure this part out.
The thing I have har harde time is with the --vectorize flag in the chapel compiler, and how this relates to foreach loops. I don't think the challenge here is the theory but rather how I can peek inside the black box. I want to se the effects of the code I am doing. I am happy if I can see that my loops are done in with any kind of SIMD instruction.
So to get to know this i ran the simplest experient. I wrote in chapel and c and tried to write the same programe. I worte the simplest program I could imagine.
First I wrote a program in c. I verified that clang for my mac does a decent job unrolling and vectorising.
My program looks like
#include <stdio.h>
void sum() {
int A[256];
int B[256];
for ( int i = 0; i < 256; i++ )
{
A[i] = B[i] + 1;
}
printf("%d", A[0]);
}
int main() {
sum();
return 0;
}
I compiled with clang -O3 -march=native -Rpass=loop-vectorize -S Poc.c and I got a program alog with this compiler output.
Poc.c:6:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
for ( int i = 0; i < 256; i++ )
^
Poc.c:6:5: remark: vectorized loop (vectorization width: 8, interleaved count: 4) [-Rpass=loop-vectorize]
Clang tells me that this loop was vectorised. This is great feedback. I can imagine this loop to be 8 times faster than sequential loops.
Then I tried my best to do the same with chapel.
/* Documentation for Poc */
module Poc {
writeln("New library: Poc");
proc sum() {
var A : [0..255] int(32);
var B : [0..255] int(32);
for i in A.domain do
A[i] = B[i] + 1;
}
proc main() {
sum();
}
}
I assumed I can send the same input to clang under the hood of the chapel compiler. So I compiled with
chpl src/Poc.chpl --fast --target-compiler=llvm --ccflags -Rpass=loop-vectorize --print-commands --explain-verbose . The important parts is how I explicitly state --target-compiler=llvm to use the correct compiler, and --ccflags -Rpass=loop-vectorize to get my loops vectorized.
The out does not help me much. I can't see what is going on.
This loop is so simple that it should be trivial to run faster than sequential code. It almost asks for optimisations. For I know it might be optimal. Unfortunatly I can not see how the compiler treats this loop.
What chpl wrote to stdout is listed below.
<internal clang code generation> -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/standard -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/packages -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/localeModels/flat -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/localeModels -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/comm/none -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/comm -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/tasks/qthreads -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/qio -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/atomics/cstdlib -I/usr/local/Cellar/chapel/1.26.0/libexec/runtime/include/mem/jemalloc -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/utf8-decoder -DCHPL_JEMALLOC_PREFIX=chpl_je_ -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/jemalloc/install/target/darwin-x86_64-native-llvm-none/include -I/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/include -I. -I/var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l -O3 -DCHPL_OPTIMIZE -march=native -I/usr/local/Cellar/chapel/1.26.0/libexec/modules/internal -Rpass=loop-vectorize -DCHPL_GEN_CODE -pthread -I/usr/local/include -include sys_basic.h -include ctype.h -include wctype.h -include llvm/chapel_libc_wrapper.h
# Make Binary - Linking
/usr/local/Cellar/llvm/13.0.1_1/bin/clang++ /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/chpl__module.o /usr/local/Cellar/chapel/1.26.0/libexec/lib/darwin/llvm/x86_64/cpu-native/loc-flat/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none/main.o -o /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/Poc.tmp -L/usr/local/Cellar/chapel/1.26.0/libexec/lib/darwin/llvm/x86_64/cpu-native/loc-flat/comm-none/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none -lchpl -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -lhwloc -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/qthread/install/darwin-x86_64-native-llvm-none-flat-jemalloc-bundled/lib -lqthread -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/hwloc/install/darwin-x86_64-native-llvm-none-flat/lib -lhwloc -lchpl -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/jemalloc/install/target/darwin-x86_64-native-llvm-none/lib -ljemalloc -L/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/lib -Wl,-rpath,/usr/local/Cellar/chapel/1.26.0/libexec/third-party/re2/install/darwin-x86_64-native-llvm-none/lib -lre2 -lgmp -lm -lpthread -L/usr/local/lib
rm -f Poc
mv /var/folders/1z/crkhksz13zqbz175phk0s8lh0000gn/T//chpl-andreas.dreyer.hysing.deleteme-Np3e4l/Poc.tmp Poc
For the curious readers