[1.29.0] Infinite Looping in splitInitMissingTypeError

I'm still working on a reproducer and figuring out how to get chpl --gdb to work (something about the runtime not being built with debugging support despite the DEBUG=1 make flag) but I've run into a situation that manages to infinitely-loop the 1.29.0 compiler somewhere inside splitInitMissingTypeError.

I was working on casting an instance of an unmanaged class into /out of a C pointer (to use as in an opaque handle type). The hang started happening when I added the unmanaged keyword to this line

var castData = handle.data : c_ptr(unmanaged CSR_type);

But that pattern alone has not been enough to trigger it in a stripped-down reproducer. Will keep digging after some meetings and reply back

this is from gdb --args chpl... chpl --gdb still gives

error: The runtime has not been built for this configuration. Run $CHPL_HOME/util/chplenv/printchplbuilds.py for information on available runtimes.

Hi @psath -

Sorry for the trouble here. My response here is in two parts to do with two different strategies for resolving the problem.

First, I have used a similar pattern before and I do have a suggestion.

You wrote:

var castData = handle.data : c_ptr(unmanaged CSR_type);

But, usually what I want to do, is to take a c_void_ptr and turn it into an unmanaged class type.

var castData = handle.data : unmanaged CSR_type?;

Here is a complete example that compiles and runs for me:

use CTypes;

class MyClass { var field: int; }

proc main() {
  var x = new unmanaged MyClass();
  var ptr = x : c_void_ptr;
  writeln(ptr); // write'ing a c_void_ptr prints the address
  // now cast back to unmanaged
  var y = ptr : unmanaged MyClass?; // casts to nilable unmanaged
                                    // since a c_void_ptr can be nil/null
  writeln(y : c_void_ptr);
}

Second, I have some thoughts about helping us to track down the compiler bug. First, it would be very helpful if you are able to create a reproducer. The stack trace you posted doesn't look like an infinite loop to me, but it does look like the compiler is trying to emit an error. If you are interested in working with gdb, you can see what it is that it's trying to error about with e.g. in frame 2 splitInitMissingTypeError.

Here is an example:

aa.chpl

1 module M {
2   proc main() {
3    var x;
4  }
5 }
$ chpl aa.chpl  --gdb
Reading symbols from chpl...
Breakpoint 1 at 0xb1740
(gdb) break splitInitMissingTypeError
Breakpoint 2 at 0x4a0209: file /home/mppf/w/10/compiler/passes/splitInit.cpp, line 456.
(gdb) r
Starting program: /home/mppf/w/10/bin/linux64-x86_64/chpl aa.chpl
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after vfork from child process 4824]

Breakpoint 2, splitInitMissingTypeError (sym=0x5555594cb4e0, mention=0x0, unresolved=false) at /home/mppf/w/10/compiler/passes/splitInit.cpp:456
456	  const char* name = toString(sym, false);
(gdb) print nprint_view(sym)
'unknown x[182721]:_splitInitType[49]'

$1 = void
(gdb) print stringLoc(sym)
$2 = 0x555555d590c0 <locBuff> "aa.chpl:3"

Thanks,

-michael

No matter when I stop it, (I let it run for 15 hours in this case. This code usually takes 23 seconds.) it always is somewhere inside splitInitMissingTypeError. It might not be while(true) but it's a cycle of some sort. Given it didn't fault or gobble up RAM after 15 hours, I'm suspecting not infinite recursion either.

chpl --gdb gives me the obscure error below. The stack trace is from gdb --args chpl

error: The runtime has not been built for this configuration. Run $CHPL_HOME/util/chplenv/printchplbuilds.py for information on available runtimes.

Even though this is how I built the "debug build"

#!/bin/bash
CHPL_VERSION=1.29.0
CHPL_LLVM_OVERRIDE=system

#Get tarball
mkdir -p src_tarballs
if [ ! -f src_tarballs/chapel-$CHPL_VERSION.tar.gz ]; then
wget -P src_tarballs https://github.com/chapel-lang/chapel/releases/download/$CHPL_VERSION/chapel-$CHPL_VERSION.tar.gz
fi

#Unpack it
if [ ! -d chapel-$CHPL_VERSION-src-$HOSTNAME ]; then
tar -xzf src_tarballs/chapel-$CHPL_VERSION.tar.gz && mv chapel-$CHPL_VERSION chapel-$CHPL_VERSION-src-$HOSTNAME
fi

#set it up and build it
pushd chapel-$CHPL_VERSION-src-$HOSTNAME
source util/setchplenv.bash
export CHPL_CUDA_PATH=`pwd`/../cuda-$HOSTNAME
export CHPL_LOCALE_MODEL=gpu
export CHPL_LLVM=$CHPL_LLVM_OVERRIDE
export CC=clang
export CXX=clang++

#traditional build
make clean
#./configure --prefix=`pwd`/../../build_dirs/$HOSTNAME/chapel-$CHPL_VERSION
./configure --prefix=`pwd`/../chapel-$CHPL_VERSION-build-$HOSTNAME
make -j `nproc`
make install

#debug build
make clean
./configure --prefix=`pwd`/../chapel-$CHPL_VERSION-build.gdb-$HOSTNAME
make -j `nproc` DEBUG=1 OPTIMIZE=0
make install DEBUG=1 OPTIMIZE=0
popd

As I said, I'm still working on getting the bug to trigger in the partial reproducer but I'm using the a similar pattern, minus the ? on the cast back into Chapel types

Allocator:

proc NewFooHandle(type foo_type : foo(?)) : foo_handle {
  var retFoo = new foo_type();
  retFoo.size = 8675309;
  writeln("Inititalized size is: ", retFoo.size);
  var retCast = c_ptrTo(retFoo);
  writeln(retCast);
  var retHandle : foo_handle;
  retHandle.desc = new foo_desc(foo_type.isWeighted, foo_type.isVertexT64, foo_type.isEdgeT64, foo_type.isWeightT64);
  retHandle.data = (retCast : c_void_ptr);
  return retHandle;
}

User:

proc RecastFoo(type foo_type : foo(?), in handle : foo_handle) {
  assert(handle.desc.isWeighted == foo_type.isWeighted, "Provided foo_handle: ", handle :string, " incompatible with recast type: ", foo_type : string);
  var castData = handle.data : c_ptr(unmanaged foo_type); //Adding unmanaged to this line started hanging the compiler in the full code
  writeln(castData : string);
  var data = castData.deref() : foo_type;
  writeln("Read data size: ", data.size);
  writeln("Weighted: ", handle.desc.isWeighted);
}

As an aside the deref without the preceeding writeln prints garbage data. And causes a runtime fault once the writeln is added, but I'll debug pointer stuff after it reproduces the loop.

Meeting time, more later

I think you need to use new unmanaged in this case; otherwise the class instance is allocated as owned and will be deallocated at the end of this function.

Here is what I would recommend (and here I'm assuming that foo_handle and foo_desc are some sort of extern record):

proc NewFooHandle(type foo_type : foo(?)) : foo_handle {
  var retFoo = new unmanaged foo_type();
  retFoo.size = 8675309;
  writeln("Inititalized size is: ", retFoo.size);
  var retCast = retFoo : c_void_ptr;
  writeln(retCast);
  var retHandle : foo_handle;
  retHandle.desc = new foo_desc(foo_type.isWeighted, foo_type.isVertexT64, foo_type.isEdgeT64, foo_type.isWeightT64);
  retHandle.data = (retCast : c_void_ptr);
  return retHandle;
}

Originally I had unmanaged in the allocator, but guess I took it back out when i was poking this before the end of yesterday. With/without hasn't changed whether it does trigger in the full code, doesn't trigger in the reproducer (so far).

Yeah it should give a segfault without it right? That's what happened when we were still attempting records, expecting new to do a C++-style heap allocation (without auto-delete). Either way, runtime fault orthogonal to the compiler loop.

foo_handle and foo_desc are Chapel records that mirror records in the full code, but the reproducer doesn't reproduce (yet). Cart before horse.

Yes, agreed. And, I also agree it is some sort of infinite loop (I had not caught on to the 15 hour runtime in your 1st post).

Just looking at that loop in toString(VarSymbol,,

  Symbol* sym = var;
  // Compiler temporaries should have a single definition
  while (sym->hasFlag(FLAG_TEMP) && !sym->hasFlag(FLAG_USER_VARIABLE_NAME)) {
    SymExpr* singleDef = sym->getSingleDef();
    if (singleDef != NULL) {
      if (CallExpr* c = toCallExpr(singleDef->parentExpr)) {
        if (c->isPrimitive(PRIM_MOVE) ||
            c->isPrimitive(PRIM_ASSIGN)) {
          SymExpr* dstSe = toSymExpr(c->get(1));
          SymExpr* srcSe = toSymExpr(c->get(2));
          if (dstSe && srcSe && dstSe->symbol() == sym) {
            sym = singleDef->symbol();
            continue;
          }
        }
      }
    }

    // Give up
    sym = NULL;
    break;
  }

It seems like if we entered that loop in a state where sym->getSingleDef() was referring to the eventual dstSe (instead of srcSe like I think it's expecting?), it'll keep getting effectively sym = sym; continue;?

Does it change the behavior at all if you compile with chpl --verify? (For the code where the compiler hangs.)

Apologies for the delay getting back to y'all. Other fires to put out :slight_smile:

--verify doesn't change anything.

The introduction of the unmanaged keyword appears to be a red herring. The bug seems to actually be triggering on a CallExpr to a function in another module that was using the record-now-class, but hadn't been reworked yet.

I'm suspecting that since the conversion from record to class, what was once a concrete call, has now become generic w.r.t. the memory-management of the of the Foo class that is provided as both type and formal argument.

I've managed to trigger it in a minimal reproducer. I'm "too new" of a user to attach files, but thankfully the code is short enough to copy/paste. Foo.m and Foo.User.m compile without issue, it is only when repro_main.chpl is compiled that it loops.

Foo.chpl:

module Foo {
  class foo {
    var size : int(64);
    param isWeighted : bool;
    var arrDom : domain(1) = {1..(if isWeighted then size else 0)};
  }
}

FooUser.chpl:

module FooUser {
  use Foo;
  proc Use_Foo(type inType : foo, in data : inType, type outType : real(?), outWeights : [] outType) {
  //Do stuff
  }
}

repro_main.chpl:

module Reproducer {
  use Foo;
  use FooUser;
  proc main() {
      var myFoo : foo(false);
      var retDom : domain(1) = {1..10};
      var retArr : [retDom] real(32);
      //There is only one CallExpr in Reproducer.main, so the loop seems to trigger on this
      Use_Foo(foo(false), myFoo, real(32), retArr);
  }
}

Makefile:

HOSTNAME=$(shell hostname)
CFLAGS:=$(CFLAGS)
LDFLAGS:=$(LDFLAGS)
CC=clang
CXX=clang++

CHPL_CUDA_PATH=../../tool-installs/cuda-$(HOSTNAME)
CHPL_BIN_PATH=../../tool-installs/chapel-latest.gdb-$(HOSTNAME)/bin

CHPL:= CHPL_CUDA_PATH=$(CHPL_CUDA_PATH) $(CHPL_BIN_PATH)/chpl
CHPL_FLAGS := $(CHPL_FLAGS) $(CHPL_GPUAPI_MODULES) 

#CHPL_FLAGS := $(CHPL_FLAGS) -g  --no-optimize --baseline
CHPL_FLAGS := $(CHPL_FLAGS) -g --devel --verify


.PHONY: all
all: reproducer

reproducer: Foo.m FooUser.m repro_main.chpl
	$(CHPL) repro_main.chpl -o reproducer $(CHPL_FLAGS)

Foo.m: Foo.chpl
	$(CHPL) Foo.chpl -o Foo.m $(CHPL_FLAGS)

FooUser.m: FooUser.chpl
	$(CHPL) FooUser.chpl -o FooUser.m $(CHPL_FLAGS)

.PHONY: clean
clean:
	rm *.m reproducer

Thanks for the reproducer. I have the infinite loop fixed but there is still something wrong with the error message. I'm working on it.

In the meantime, this is the error that the compiler should be giving for your reproducer:

Reproducer.chpl:4: In function 'main':
Reproducer.chpl:5: error: Cannot default-initialize a variable with generic type
Reproducer.chpl:5: note: 'myFoo' has generic type 'anymanaged foo(false)'
Reproducer.chpl:5: note: 'foo(false)' indicates a non-nilable class with any management
Reproducer.chpl:5: note: consider adding a management decorator such as 'owned', 'shared', 'borrowed', or 'unmanaged'

Hope that helps.

Happy to help!

Thanks for the "should have gotten" compiler error, confirms my suspicion that it was due to the lack of a memory qualifier, albeit on the variable declaration, not the function call.

Feel free to close the issue at your discretion.

I've created Resolve an compiler hang when reporting a split init error by mppf · Pull Request #21692 · chapel-lang/chapel · GitHub to fix this.