Undefined reference to jemalloc

The following build on an InfiniBand cluster:

wget https://github.com/chapel-lang/chapel/releases/download/1.31.0/chapel-1.31.0.tar.gz
tar xvfz chapel-1.31.0.tar.gz
source chapel-1.31.0/util/setchplenv.bash
export CHPL_COMM=gasnet
export CHPL_LLVM=none
export CHPL_COMM_SUBSTRATE=ibv
export CHPL_LAUNCHER=gasnetrun_ibv
export CHPL_GASNET_SEGMENT=large
export CHPL_TARGET_CPU=native
export CHPL_TASKS=qthreads
export CHPL_MEM=jemalloc
export CHPL_HOST_JEMALLOC=bundled
cd $CHPL_HOME
mkdir ~/chapel-1.31.0
./configure --chpl-home=$HOME/chapel-1.31.0
make -j4
make install
make check

results in errors at the last step:

_main.o:_main.c:function chpl_calloc: error: undefined reference to 'chpl_je_mallocx'
_main.o:_main.c:function chpl_malloc: error: undefined reference to 'chpl_je_mallocx'
_main.o:_main.c:function chpl_memalign: error: undefined reference to 'chpl_je_mallocx'
_main.o:_main.c:function chpl_realloc: error: undefined reference to 'chpl_je_mallocx'
_main.o:_main.c:function chpl_realloc: error: undefined reference to 'chpl_je_dallocx'
(many more lines)

Is there a missing jemalloc path variable of some sort in my build?

Thank you,

Alex.

Hi Alex —

This error mode isn't familiar to me, but we'll try to figure out what's going wrong. The jemalloc library is bundled with Chapel and should have been built and installed as part of your previous make commands. I assume there were no previous errors reported in your build?

Here are some things I'd try to get more information on what's going wrong, where I'm assuming a certain familiarity with C compiler -L/-l flags, but let me know if you need more information:

  • From your ~/chapel-1.31.0 directory, if you use the copy of chpl stored under bin (within a very platform-specific subdirectory) to compile an arbitrary Chapel program (like hello.chpl), do you get the same errors? I think you should/will, but if not, that suggests the problem is more likely to be something in the make check command than in the installation itself.

  • If you re-run that compilation using the --print-commands flag, do you see mentions of -L/-l flags involving jemalloc in the # Make Binary - Linking stage? For example, on my system, I can do:

    bin/linux64-x86_64/chpl $CHPL_HOME/examples/hello.chpl --print-commands
    

    and under:

    # Make Binary - Linking
    

    see the library-related flags:

    -L/users/bradc/chapel-install/third-party/jemalloc/install/target/linux64-x86_64-none-llvm-none/lib -ljemalloc
    
  • Next, do the directory and library in question exist? For example, in my install, if I go to the directory above, and do ls, I see:

    libjemalloc.a*	libjemalloc_pic.a*
    
  • Next, does the library being requested by the -l option define the missing symbols? For example, in my case, I do:

    nm libjemalloc.a | grep chpl_je_mallocx
    

    and get:

    0000000000003780 T **chpl_je_mallocx**
    

    suggesting that it does.

  • Finally, if you try these similar commands from the directory where you initially built Chapel rather than the one into which you installed it, do you get different results?

Thanks for any additional information along these lines,
-Brad

Hi Brad,

Thank you for getting back to me so quickly!

Yes, the installed executable ~/chapel-1.31.0/bin/linux64-x86_64/chpl fails to compile even the simple examples/hello.chpl code, with the same errors, so it's not limited to make check.

Recompiling with --savec tmp --print-commands, it breaks in the # compiling generated source section. The exact offending command is:

mpicxx -o tmp/hello.tmp tmp/_main.o /home/razoumov/chapel-1.31.0/lib/linux64/gnu/x86_64/cpu-native/loc-flat/comm-gasnet/ibv/large/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none/main.o -L/home/razoumov/chapel-1.31.0/lib/linux64/gnu/x86_64/cpu-native/loc-flat/comm-gasnet/ibv/large/tasks-qthreads/tmr-generic/unwind-none/mem-jemalloc/atomics-cstdlib/hwloc-bundled/re2-bundled/fs-none/lib_pic-none/san-none -lchpl -L/home/razoumov/chapel-1.31.0/third-party/gmp/install/linux64-x86_64-native-gnu-none/lib -lgmp -Wl,-rpath,/home/razoumov/chapel-1.31.0/third-party/gmp/install/linux64-x86_64-native-gnu-none/lib -L/home/razoumov/chapel-1.31.0/third-party/hwloc/install/linux64-x86_64-native-gnu-none-flat/lib -lhwloc -Wl,-rpath,/home/razoumov/chapel-1.31.0/third-party/hwloc/install/linux64-x86_64-native-gnu-none-flat/lib -D_GNU_SOURCE=1 --param max-inline-insns-single=35000 --param inline-unit-growth=10000 --param large-function-growth=200000 -L/home/razoumov/chapel-1.31.0/third-party/gasnet/install/linux64-x86_64-native-gnu-none/substrate-ibv/seg-large/lib -lgasnet-ibv-par -L/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64 -libverbs -L/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/gcccore/9.3.0/lib/gcc/x86_64-pc-linux-gnu/9.3.0 -lgcc -L/home/razoumov/chapel-1.31.0/third-party/qthread/install/linux64-x86_64-native-gnu-none-flat-jemalloc-bundled/lib -Wl,-rpath,/home/razoumov/chapel-1.31.0/third-party/qthread/install/linux64-x86_64-native-gnu-none-flat-jemalloc-bundled/lib -lqthread -lchpl -L/home/razoumov/chapel-1.31.0/third-party/jemalloc/install/target/linux64-x86_64-native-gnu-none/lib -ljemalloc -L/home/razoumov/chapel-1.31.0/third-party/re2/install/linux64-x86_64-native-gnu-none/lib -lre2 -Wl,-rpath,/home/razoumov/chapel-1.31.0/third-party/re2/install/linux64-x86_64-native-gnu-none/lib -lnuma -lm -lpthread

There is -L/home/razoumov/chapel-1.31.0/third-party/jemalloc/install/target/linux64-x86_64-native-gnu-none/lib -ljemalloc in there, and if I move it closer to the beginning of the command, the line works. In fact, the offending sequence is:

-L/cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64 -libverbs \
-L/home/razoumov/chapel-1.31.0/third-party/jemalloc/install/target/linux64-x86_64-native-gnu-none/lib -ljemalloc \

If I reverse the order of the two lines, the code compiles fine. So, the collision is with the InfiniBand's RDMA library.

There is no problem with the libjemalloc.a itself, it defines the needed symbols:

$ nm ~/chapel-1.31.0/third-party/jemalloc/install/target/linux64-x86_64-native-gnu-none/lib/libjemalloc.a | grep chpl_je_mallocx
0000000000007a40 T chpl_je_mallocx
0000000000000046 t chpl_je_mallocx.cold

From the build directory I get the same errors:

$ pwd
/tmp/razoumov/chapel-1.31.0
$ which chpl
/tmp/razoumov/chapel-1.31.0/bin/linux64-x86_64/chpl
$ chpl ./examples/hello.chpl -o hello
/tmp/chpl-razoumov.deleteme-6ATbbJ/_main.o:_main.c:function chpl_calloc: error: undefined reference to 'chpl_je_mallocx'
/tmp/chpl-razoumov.deleteme-6ATbbJ/_main.o:_main.c:function chpl_malloc: error: undefined reference to 'chpl_je_mallocx'
...

Thank you,

Alex.

Hi Alex —

Oh, that's interesting, and I'm glad you were able to determine the source of the issue and a possible workaround... This issue doesn't sound at all familiar to me, and I think would be worth opening a GitHub issue for on our repository.

I'm tagging @ronawho on this, who is most familiar with our use of jemalloc (and has a better memory than mine) to see if he's run into this before.

To make sure I'm understanding, is the hypothesis that there's some conflict between the IB RDMA library and our jemalloc library that causes the RDMA library to be loaded when it's earlier in the path, and prevents jemalloc from getting loaded due to the conflict? And if so, do you happen to know what the conflict is?

Thanks again,
-Brad

Hi Brad,

Yes, that's precisely the issue. I don't know what the conflict is, as there do not seem to be any common names defined in both libibverbs.so and libjemalloc.a.

BTW this change happened in Chapel between v1.25.0 (still compiles fine for me) and v1.26.0.

I could open an issue in the GitHub repository, but I don't have any workaround, except for suggesting to move -L/path/to/lib -ljemalloc earlier in the line. I scanned the Chapel source code but could not find the place where to edit and test this.

Thank you,

Alex.

Hi Alex —

Elliot asked me something that seems obvious and more likely than my conflicting symbols theory in retrospect: By any chance is there a libjemalloc.* in your /cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64 path?

If not, I'm wondering whether you could look into whether there are "verbose"/"trace"-style flags we could pass to the linker to have it tell us more about what it's doing, though I suspect the details will depend on what linker/compiler mpicxx is wrapping in your environment. I don't mess with the linker very often, but this SO post looked helpful.

The timing you note between 1.25 and 1.26 suggests the following PR which Elliot remembered was reordered the -L flags thatt we use: https://github.com/chapel-lang/chapel/pull/18880 What I can't recall (and am not quickly finding the answer to in reading the PR) is why the system ones would've been reordered to precede the Chapel ones. Tagging @mppf who developed it to see if he recalls.

If the issue does turn out to be one of getting a system libjemallioc (and there's a good reason for that system jemalloc to be there), a potential fix would be for us to rename our bundled libraries to something more unique similar to what was suggested here: problems with -I and -L ordering · Issue #18840 · chapel-lang/chapel · GitHub

Don't feel like you need to have a proposed workaround or fix to open the GitHub issue. The benefit to doing so is that it'd be part of the permanent record of work to do (and potentially more visible to people running into a similar issue), whereas this thread will get lost in our inboxes at some point.

When I wrote my previous response, I was mildly optimistic that adding a -L flag to the chpl command-line would cause it to precede the system -Ls, but unfortunately, it seems to get added at the end... :frowning: Currently checking to see whether I can come up with another short-term workaround...

-Brad

Alex —

Not very satisfying, but I believe the following mod to the compiler code will swap the order of the -L flags and may resolve the issue. But I'm not suggesting that this is a reasonable long-term approach, and it seems likely that it could lead to other issues.

My real hope is that there is a second libjemalloc on your system that shouldn't be there, and that removing it will resolve the issue. :slight_smile:

-Brad

diff --git a/compiler/llvm/clangUtil.cpp b/compiler/llvm/clangUtil.cpp
index 57ea702c92..64d479b814 100644
--- a/compiler/llvm/clangUtil.cpp
+++ b/compiler/llvm/clangUtil.cpp
@@ -5361,11 +5361,6 @@ static std::string buildLLVMLinkCommand(std::string useLinkCXX,
     command += dotOFiles[i];
   }
 
-  for (size_t i = 0; i < clangLDArgs.size(); ++i) {
-    command += " ";
-    command += clangLDArgs[i];
-  }
-
   // Put user-requested libraries at the end of the compile line,
   // they should at least be after the .o files and should be in
   // order where libraries depend on libraries to their right.
@@ -5379,6 +5374,11 @@ static std::string buildLLVMLinkCommand(std::string useLinkCXX,
     command += libName;
   }
 
+  for (size_t i = 0; i < clangLDArgs.size(); ++i) {
+    command += " ";
+    command += clangLDArgs[i];
+  }
+
   return command;
 }

Right, PR 18880 was intended to solve this kind of issue (when we have an incompatible jemalloc and are using the bundled jemalloc, say). Of course, it's possible there is a bug in it somewhere. We have very limited testing for the cases where there is both a system-wide install and a Chapel bunded install of a dependency that are incompatible.

why the system ones would've been reordered to precede the Chapel ones.

I would think we would put the bundled ones first and then the system ones?

That's what it seems to do here:

Of course, there might be some issue causing the reverse to happen in this case.

Hi Brad,

Ah, yes, there is /cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libjemalloc.so on the system, so the linker is looking for chpl_je_mallocx and the rest in the wrong library.

I tried applying the patch you suggested yesterday that moved

for (size_t i = 0; i < clangLDArgs.size(); ++i) {
  command += " ";
  command += clangLDArgs[i];
}

towards the end of buildLLVMLinkCommand, and recompiled/reinstalled Chapel from scratch, but as far as I can tell that did not modify the linking order: -L/path/to/chapel-1.31.0/third-party/jemalloc/install/target/linux64-x86_64-native-gnu none/lib -ljemalloc still appears towards the end of the linking command.

I will open an issue in GitHub.

Thank you,

1 Like

I opened the issue libjemalloc library conflict on InfiniBand systems · Issue #23282 · chapel-lang/chapel · GitHub

Thank you,

Alex.

1 Like

Hi Alex —

First, thanks for filing the GitHub issue. Is this a critical / blocking issue for you?

Hmm, I must've been overly optimistic. I expect that if you were to apply the patch and also use the -L flag with chpl to explicitly specify the bundled third-party path, it will probably work though. That is:

  • apply patch
  • rebuild chpl
  • compile with chpl -L/home/razoumov/chapel-1.31.0/third-party/jemalloc/install/target/linux64-x86_64-native-gnu-none/lib myProgram.chpl

It will probably work. Rather than supplying that long flag each time, you should be able to set CHPL_LIB_PATH to get the same effect.

@mppf: Good point about the -L ordering, and I do seem to see the bundled ones precede the system ones in my own compiles... I had just been assuming the incorrect ordering based on the reported ordering with system paths first above. I'm only now noticing that this is with the C back-end (CHPL_LLVM=none) and a quick check suggests that the C back-end still uses the Makefile-based link line, so they must define the -L paths in different orders. I'll look a bit more into that.

-Brad

Alex —

Aha, and if I'm right, I think that explains why my workaround is wrong and suggests a much simpler workaround, which is to:

  • not patch or rebuild your compiler (or, back out the patch and rebuild)
  • export GEN_LFLAGS= -L/home/razoumov/chapel-1.31.0/third-party/jemalloc/install/target/linux64-x86_64-native-gnu-none/lib in your environment
  • re-compile

I believe this will put the bundled version of the jemalloc path first in the linker line, causing it to win out over the system path until we can get to the root cause of the issue.

-Brad

Hi Brad,

Your last suggestion (no patch, GEN_LFLAGS) actually worked -- thank you very much!

I'll be installing Chapel 1.31.0 centrally on the cluster for all users, and this fix will be easy to integrate into our CVMFS environment.

Thank you again,

Alex.

1 Like