I'm glad this issue is resolved for you and that Chapel beats out Fortran now!
As follow ups here, I've opened 2 issues, the second one is specifically about improving the performance without needing to write an explicit forall.
- [Feature Request]: Support a `--vector-library` compilation flag for Chapel · Issue #27094 · chapel-lang/chapel · GitHub
- Can/should the Chapel compiler do loop fusing of parallel loops? · Issue #27076 · chapel-lang/chapel · GitHub
Brad has also opened Is defaulting the number of threads to #cores vs. #hyperthreads still the right choice? (MAX_PHYSICAL vs. MAX_LOGICAL) · Issue #27099 · chapel-lang/chapel · GitHub about the threading default
-Jade