External Issue: [Bug]: jemalloc issues with lots of cores (HPE Superdome)

24736, "hpcpony", "[Bug]: jemalloc issues with lots of cores (HPE Superdome)", "2024-03-31T18:14:50Z"

I'm seeing problems with chapel/jemalloc on a machine with lots of cores (HPE Superdome w/ 1568 cores including hyperthreading). I know this is kind of an unusual case, but I thought I'd mention it.

[host7:Chapel] uname -a
Linux host7 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 15 07:18:13 EST 2024 x86_64 x86_64 x86_64 GNU/Linux

[host7:Chapel] /opt/CHAPEL/chapel-2.0.0_host7/util/printchplenv
machine info: Linux host7 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 15 07:18:13 EST 2024 x86_64
CHPL_HOME: /opt/CHAPEL/chapel-2.0.0_host7 *
script location: /opt/CHAPEL/chapel-2.0.0_host7/util/chplenv
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native +
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet +
  CHPL_COMM_SUBSTRATE: smp +
  CHPL_GASNET_SEGMENT: fast +
CHPL_TASKS: qthreads
CHPL_LAUNCHER: smp
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled +
CHPL_HWLOC: bundled
CHPL_RE2: bundled +
CHPL_LLVM: bundled +
CHPL_AUX_FILESYS: none

[host7:Chapel] chpl --version
<jemalloc>: Reducing narenas to limit (4094)
chpl version 2.0.0
  built with LLVM version 17.0.6
  available LLVM targets: x86-64, x86
Copyright 2020-2024 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)

Test case:

[host7:Chapel] cat x.chpl
writeln("Hello");

[host7:Chapel] chpl x.chpl
<jemalloc>: Reducing narenas to limit (4094)
<jemalloc>: Reducing narenas to limit (4094)
<jemalloc>: Reducing narenas to limit (4094)

[host7:Chapel] ./x -nl 2
<jemalloc>: Reducing narenas to limit (4094)
internal error: could not change current thread's arena
internal error: could not change current thread's arena

Poking around it looks like there are a number of ways to limit the number of arenas. I'm not sure it's any better than just letting jemalloc reduce it to 4094 but it's explicit.

[host7:Chapel] export MALLOC_CONF='narenas:2048'
[host7:Chapel] chpl x.chpl

[host7:Chapel] ./x -nl 2
<jemalloc>: Reducing narenas to limit (4094)
internal error: could not change current thread's arena
internal error: could not change current thread's arena

Problem 1 is that jemalloc doesn't seem to know how to deal with this many cores. I think the default is to create 4 x cores = narena (6272 in my case) but it looks like there's only 12 bits internal to use for arenas.

.../third-party/jemalloc/jemalloc-src/include/jemalloc/internal/jemalloc_internal.h.in:#define MALLOCX_ARENA_MAX 0xffe

Poking around in jemalloc 5.3 I think it still has this limitation.

You can apparently reduce the number of arenas (as I did above) but I'm unclear whether that's a reasonable thing to do.

Problem 2 is that if you do reduce the number of arenas it appears to fix jemalloc's complaint, but there's still something in chapel runtime that doesn't work quite right. Interestingly if I build a chapel application on a different machine (using a chapel compiler built on that other machine), I can run it on the machine with lots of core with no "reducing" and no "internal error:".

We're still mostly experimenting with chapel so prioritize as appropriate.