Unable to compile `pure virtual method called`

Hi! I'm writing a simple finite volume one dimensional fluid dynamics solver in Chapel. I have written a basic framework for my code, which is not compiling. The error message doesn't seem helpful, and I would appreciate the experience of Chapel developers in the community to help me out with this.
Here is the link to my code (small with only three files): GitHub - dutta-alankar/1d-fluid-finite-volume: One dimensional finite volume solver in Chapel

On compiling with the following command, chpl ./main.chpl ./structures.chpl ./utilities.chpl --fast -o main, I get this error:

pure virtual method called
terminate called without an active exception
error: invoking driver compilation phase

Hi Alankar —

I'm seeing failures with your program as well, though I see different error messages depending on the details of how I compile. I think I have tracked the issue down to these field declarations:

        var indicesAll: domain(?);
        var indicesInner: domain(?);
        var indicesStag: domain(?);

where, if I change them to share a single declaration (which should be legal since they are all going to end up being the same type):

        var indicesAll, indicesInner, indicesStag: domain(?);

things work for me, in the sense that I get through the compilation. Note that I'm not saying that writing it this way should be required, merely that it's a way to work around what seems to be a bug in the compiler that's unfamiliar to me. Here is a small standalone test that exhibits similar behavior:

class C {
  var d: domain(?);
  var d2: domain(?);

  proc init() {
    this.d = {1..10};
    this.d2 = {1..20};
  }
}

var c = new C();
writeln(c);

Would you see if my workaround gets things back on track for you and let us know?

Thanks,
-Brad

Hi Brad!

Thanks a lot. This solves the issue for now and the compilation succeeds. However, could you briefly walk me through in how to catch issues like this and identify the problematic line?

Regards,
Alankar

Hi Alankar —

I'm glad to hear that that the workaround helped. I wish I could tell you I had a great or elegant approach to how I narrowed this down, but it's not a very pretty story… just some failed attempts followed by some process of elimination.

Usually, as a compiler developer, when hitting an unhelpful error like this, I run the compiler in debug mode (e.g., chpl <normal command line arguments> --lldb or --gdb) and then when I hit the error, I use my familiarity with the compiler internals to see what part of the user's code was being processed when we hit the erroneous behavior. In this case, though, that didn't point to anything obvious, or even particularly helpful, which is somewhat unusual.

Since it was modest in size, I also inspected your code to see whether anything stood out as being particularly surprising or risky, but it all looked very clean and normal, so that didn't get me anywhere.

From there, I compiled each of your code's leaf modules one-by-one to see if they would reproduce the issue on their own, injecting a little bit of code to exercise them. I started with utilities.chpl since it was small, added a sample call to linspace(), and that worked as expected.

Then I moved on to structures.chpl, inserted a call to create an instance of grid and got the same error, so that let me just focus on that module. So then I tried creating instances of wall and cell, and those seemed to work as expected. So next, I started just commenting out fields and initialization lines of elements of grid using a binary search type of approach until I found that it worked if none of the fields after indicesAll were declared or initialized.

Then I added the indicesInner field back in and tried a few different ways of initializing it, none of which changed the failing behavior. So then I decided to try declaring it differently, where my first thought was to combine it with the previous field since their types were the same. This was motivated in part by one of the errors I saw in some of my compiles that referred to the type _unknown_chpl, which can be used by the compiler to represent the ? token as in the domain(?) type. Since the first domain(?) field was working, that made me think "What if there wasn't a second domain(?) type expression. And that happened to work.

So, no major takeaways or best practices that will necessarily help with future issues, as they may manifest in completely different ways. I do find in my own Chapel development (or any language, really), that compiling often as I'm adding code and making snapshots of my code to git frequently is helpful to continue making forward progress incrementally. Then, if/when something goes off the rails, I can check what code I added or changed since the last working version and sometimes narrow down the problem in that way. That said, there's obviously a tension between taking the time to compile and/or snapshot after each little change vs. making good progress in writing code, so coming up with a granularity that works well can be user- or project-specific.

I've opened Duplicate fully generic `myType(?)` fields result in compiler errors · Issue #27411 · chapel-lang/chapel · GitHub to capture this issue.

One piece of feedback that I had on your code is that I believe you could simplify your compilation line if you wanted to. Specifically, I think that chpl main.chpl --fast should give you the same behavior that you're currently getting because:

  • main.chpl defines a procedure named main so will be considered the "main module" of your Chapel program, causing the generated executable to be named after it without needing to use the -o main flag
  • Since structures.chpl and utilities.chpl define modules with the same name as the file, the occurrences of use structures and import utilities in your code will cause the compiler to search for, find, and parse those files without needing to add them to the command-line manually, if they're all in the same directory (if they weren't in the same directory, you could use the -M flag to have the compiler look for modules in another directory).

Thanks for making us aware of this issuse—I'm feeling surprised that we haven't run into it before (at least, that I can recall)…
-Brad

1 Like

If the use of

module structures
{
   ....
}

in the file structures.chpl redundant?

1 Like

Hi Damian —

Arguably yes, in that all code defined in a file named foo.chpl whose contents are not fully enclosed in a module … { … } declaration is considered to be defined by a module represented by the file named foo. But it can be considered a reasonable practice to explicitly name the module in the source code as well (e.g., most Chapel internal and standard modules do this).

This can be particularly useful if, for example, you start making clones of the file to capture variants or bugs or things you're exploring so that you can compile in foo-debug.chpl or foo-experimental.chpl or foo-v1.chpl and have it still define module foo (such that files containing use foo; or import foo; would not need to change.

In such cases, the file would need to be named on the chpl command-line, as the compiler would not jump from a use foo; to the thought that maybe it should parse a file named foo-slug.chpl in hopes that it defines a module named foo. At times, we've discussed and prototyped a capability to search through arbitrary .chpl files to find a named module, but those experiments have never stuck (so far) and might be considered too heroic.

-Brad

1 Like

Got it. Not into heroic programming.

Hi Brad,

This is extraordinary. Thanks a lot for making such an in-depth analysis of the code and also I appreciate your recommendation. I'm just beginning to program in Chapel, so your tips are super helpful.
I don't want to take a lot of your time. One quick question, now when I run the code, I get it running perfectly in 1 locale. But on multiple locales some of the wall objects in the wall array don't get formed and remain nil. Can you help me with a suggestion of what it could be?

Hi Alankar —

Not tonight I'm afraid, as it's already quite late here, but maybe on another day. Just to make sure I'm understanding the question, are you referring to the walls_tot field of the grid? And is there any pattern to the ones that are vs. are not getting set to nil? And where in your code are you running into the fact that some are incorrectly nil?

If you haven't already, note that when using stencilDist-distributed arrays, explicit calls to the updateFluff() method must be used to update the halos/ghost cells/overlap regions of the distribution. So, for example, if we're each locales, and I own {1..50, 1..100} and you own {51..100, 1..100} and we set up a fluff width of 1, I can technically access {51, 1..100} locally because I've allocated an extra row to store it, but it will not be initialized to store the values that you truly own and have initialized in row 51 without a call to updateFluff(). Only with that call will I copy your values over to my locale. So if the nil values you were seeing are along boundaries like that and you are not calling updateFluff(), that could explain the behavior—I'm essentially reading the uninitialized values in my 51st row. A call to updateFluff() after all the walls have been initialized should fix that.

This touches on another comment I was going to pass along, which is whether some of your classes might be more efficient or simpler if changed to records in your code—particularly those that act as a "value" type of object, which (with a quick glance), it seemed like wall objects might be. The distinction between classes and records in Chapel is a bit subtle, but important. This StackOverflow Q&A is a good overview of some of the distinctions and reasons to choose one or the other: https://stackoverflow.com/questions/48331007/when-should-i-use-a-record-vs-a-class-in-chapel.

Note that in asking this question, I don't mean to imply that switching from classes to records would solve the nil problem and make your code start working. Well, it would in a way, in that records don't have nil values, so you wouldn't have problems accessing nils anymore. But it would essentially change the default-initialized class elements (nils) into default-initialized record values instead. So you'd still have values that were not logically correct and would still need to fix the underlying issue.

The reason I bring it up the question of classes vs. recrods is that if I copy your 51st row of classes into my local cache of them, I'll essentially only be copying the pointer values, and the objects themselves will continue to live on your locale. Thus, when I access them, I will need to do remote reads and writes to do so, which can be expensive. Whereas if they are records, I will make copies of those records from your locale to mine and then be able to access them directly, locally, which could be cheaper.

OTOH, if there's something about the values that requires them to be shared, singleton values that are shared across locales, then classes would definitely be the way to go.

Hope this is clear, but ask questions if not,
-Brad

1 Like

No hurries @bradcray
Please take your time.
I have put in a few helpful messages with the new commit.
Here is the code result on 1 locale.

adutt@freya01:1d-simple$ ./main -nl 1
Walls on locale 0:
i=1 -> pos=-2.0
i=2 -> pos=-1.0
i=3 -> pos=0.0
i=4 -> pos=1.0
i=5 -> pos=2.0
i=6 -> pos=3.0
i=7 -> pos=4.0
i=8 -> pos=5.0
i=9 -> pos=6.0
i=10 -> pos=7.0
cell centers: 
-1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 
left wall positions: 
-2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 
right wall positions: 
-1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 

On two locales, I get

adutt@freya01:1d-simple$ ./main -nl 2
Walls on locale 0:
i=1 -> pos=-2.0
i=2 -> pos=-1.0
i=3 -> pos=0.0
i=4 -> nil!
i=5 -> nil!
i=6 -> pos=3.0
i=7 -> pos=4.0
i=8 -> pos=5.0
i=9 -> pos=6.0
i=10 -> pos=7.0
structures.chpl:96: error: assert failed - Problem: Wall 4 is nil
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[38924,1],1]
  Exit code:    1
--------------------------------------------------------------------------

Aha, I see what's going on now. I was on the right track with the updateFluff() thought last night, but there was one other issue I wasn't noticing.

First, to describe what you're seeing: Since you are using a serial loop to inspect the elements of your array, it is running completely on the current locale (locale 0) and using the following logic: "Do, I store this element locally? If so, access it and print it out. If not, communicate with the locale that owns it in order to get a copy and print it out." Since your fluff value is 2, there are two extra elements that are local to locale 0, but for which it doesn't own the main value, and similarly, two that are local to locale 1, but for which it doesn't own the main value. So my first thought was that you simply weren't calling updateFluff(). But then, checking your code, you obviously are.

That deepened the mystery slightly, but with some investigation and thought, here's what's going on: For an owned class object, only one variable referring to the object can own it at any given time. One of the impacts of this is that if an owned class variable is assigned to another, the original is set to nil to avoid duplicate ownership. So what's going on in this code is that when updateFluff() is running, it's not just updating our local fluff values to point to the original class object (as I was describing yesterday), but actually transferring ownership of the object from the original locale to our fluff value, setting the original value to nil.

This means everything is working as expected, though there are some changes you could make to resolve the issue. Before getting there, though, here's a standalone example that demonstrates this:

use StencilDist;

config const update = true;

class C {
  var x: int;
}

var D = {1..10} dmapped new stencilDist({1..10}, fluff=(2,));
var A: [D] owned C?;

forall i in D do
  A[i] = new C(i);

// If A's element type is owned, when we update, we transfer the class
// ownership from one locale to the other, making the original 'nil';
// if it's shared, then we copy the reference.
//
if update then A.updateFluff();

writeln(A);

for i in D do
  writeln(A[i]);

When this is run with --update=false (which does not call updateFluff()), I get the following output:

{x = 1} {x = 2} {x = 3} {x = 4} {x = 5} {x = 6} {x = 7} {x = 8} {x = 9} {x = 10}
{x = 1}
{x = 2}
{x = 3}
{x = 4}
{x = 5}
nil
nil
{x = 8}
{x = 9}
{x = 10}

The first print shows no nil values because when I print the entire array, we'll have each locale print out its own values, such that none of the fluff values are printed. However, in the case of the loop, locale 0 is printing out everything, so when we get to elements 6 and 7 we see nil because it owns local copies of those elements, but since updateFluff() hasn't been called, they've never been updated from their default values.

So, when we run with --update=true, we get:

{x = 1} {x = 2} {x = 3} nil nil nil nil {x = 8} {x = 9} {x = 10}
{x = 1}
{x = 2}
{x = 3}
nil
nil
{x = 6}
{x = 7}
{x = 8}
{x = 9}
{x = 10}

Here, what's happening is that though we are the original/logical owner of elements 1-5, because we've made copies of our values into locale 1's fluff, we've transferred the ownership of those classes, which has caused ours to be nil.

Here are some things you could do to remedy this situation:

  • rather than using owned C? as your element type, you could use shared C?, which would permit multiple variables to refer to the object without setting any of them to nil. This makes my test program behave as expected once updateFluff() is called, and I suspect would for your code as well (though I didn't try it). It does have the downside mentioned last night that the original object will continue to live on locale 1 (say), so we'll refer to it remotely whenever we access it from locale 1 (say).

  • you could change your class types stored in the stencil array from classes to records. Since record variables are the object where class variables refer to an object, this would cause the fluff values to be copied locally such that each locale would have a copy of the object whenever you call updateFluff(). As mentioned last night, this could also have performance benefits in that any access to the fluff values would be local (which is typically the goal when storing cached values like this in the Stencil distribution).

  • you could switch from the stencil distribution to the block distribution, which would eliminate these fluff values, the need to update them, and any confusion over ownership. Normally, for a code doing stencils, this wouldn't be recommended for performance reasons; but if your array values are classes anyway, such that accessing those fluff values still requires going across the network, it may not be that big of a loss while simplifying the code. In any case, it may make sense to develop your code using the block distribution first before switching to the stencil distribution to get things working before working on performance (which would be the main incentive for using the stencil distribution).

Of these, I'd probably recommend using the record-based implementation of cell and wall, assuming that there aren't other reasons that they make more sense to represent as classes. If there are, then for me it's a toss-up at this point between using the block distribution or the shared array elements.

Hope this helps. Give a shout if anything is unclear.
-Brad

1 Like

@bradcray Thanks a lot for taking the time to help me out. This is incredibly helpful. I was completely unaware of the business of transfer of ownership carried out by updateFluff()

Indeed it seems like that for the purpose of my program, it is best to replace classes with records. Now, I have tried doing that in the dev branch of my repo. However, I encounter a compilation error which is unhelpful as it doesn't refer to any line in my code that is generating the problem. I again need your help in fixing this. Here's the error that I encounter.

adutt@freya01:1d-simple$ chpl ./main.chpl --fast -o main
$CHPL_HOME/modules/internal/ChapelArray.chpl:1016: In method 'this':
$CHPL_HOME/modules/internal/ChapelArray.chpl:1018: error: Scoped variable cannot be returned
$CHPL_HOME/modules/internal/DefaultRectangular.chpl:1380: In method 'dsiAccess':
$CHPL_HOME/modules/internal/DefaultRectangular.chpl:1382: error: Scoped variable cannot be returned
$CHPL_HOME/modules/internal/ChapelArray.chpl:1016: In method 'this':
$CHPL_HOME/modules/internal/ChapelArray.chpl:1018: error: Scoped variable cannot be returned
$CHPL_HOME/modules/internal/DefaultRectangular.chpl:1380: In method 'dsiAccess':
$CHPL_HOME/modules/internal/DefaultRectangular.chpl:1382: error: Scoped variable cannot be returned
$CHPL_HOME/modules/dists/StencilDist.chpl:1650: In method 'dsiAccess':
$CHPL_HOME/modules/dists/StencilDist.chpl:1652: error: Scoped variable cannot be returned
$CHPL_HOME/modules/dists/StencilDist.chpl:1650: In method 'dsiAccess':
$CHPL_HOME/modules/dists/StencilDist.chpl:1652: error: Scoped variable cannot be returned

Hi @dutta-alankar

We can definitely take a look. I've seen this error myself a few times recently when trying to return a local variable incorrectly, but only when I've been making breaking changes in the compiler itself. In my cases, though, I similarly felt that the error message could be more helpful.

Is this using the code in the dev branch of your repository? If so, there are some weird things going on here which I haven't figured out enough to be able to explain. Specifically, your code seems fine to me, and doesn't seem inherently different than other Chapel programs I've seen or written, yet it is generating behavior that seems very odd to me. It almost seems as though the compiler is trying to create cell objects of its own before running your code, but with a quick look, I'm not seeing any reason that it would be doing so.

I'm about to time out for the day, and am very unsatisfied with my understanding of the issues, so think this deserves a better look next week.

Through trial and error, I found that if I made the following two changes, your code seemed to compile and run:

  • I changed the declaration of state_domain to simply be const state_domain = {1..5};
  • In main, before allocating a grid object, I allocated a wall (e.g., as the first line within main): var w = new wall();

I don't believe either of these changes ought to be necessary, though, so it seems like we're hitting some buggy behavior that is not immediately familiar to me.

Would you see if one or both of these workarounds permits you to make progress here, and again, we'll look into this more after the weekend.

One other note I should've said earlier (that's unrelated to this behavior), is that I'd strongly encourage developing without using --fast, as doing so will turn on checks for common issues (like out-of-bounds array accesses). So typically, --fast should only be used when a program has been debugged and is working correctly and better performance is desired. That said, the use or lack of use of --fast does not seem to affect the behavior I'm seeing here.

-Brad

1 Like

@bradcray Thanks a lot for the help! I confirm that I'm referring to the dev branch of the repo. From now on, I'll use --fast only for production runs. There were two problems that I encountered.

  • I could not directly use state_domain. I don't know why but it gives me the following compile error
structures.chpl:5: error: module-scope variable 'state_domain' may be used before it is initialized
structures.chpl:5: note: 'state_domain' declared and initialized here
structures.chpl:5: note: calls function 'init'
structures.chpl:41: note: calls function 'init'
structures.chpl:11: note: mentions module-scope variable 'state_domain' not initialized yet

So as a workaround, I created a dummy domain variable inside the record called states_count and made it equal to state_domain. I do not have any understanding of why this was needed.

On a side note, I also had to have two different init proc. Otherwise var wall_left: wall; was generating a unresolved call 'wall.init()' compile error which I perhaps understand as it needs a default init with no arguments (Is it correct to interpret the error like this? I thought that records always come with a default initializer and I didn't have to create one manually). This is the error.

this candidate did not match: wall.init(position: real(64))
  • As you suggested adding the dummy line var w = new wal(); in main.chpl before grid creation indeed caused the compilation to go through successfully (even var w: wall; also works). This entire cure seems quite of a magic to me, and I have no understanding of why this is needed and what was going wrong in the earlier code.

Anyways, I'm committing the new edits to the dev branch. I look forward to your tests. Good news is that now switching to records from classes, things work as I intended them to. Your help would be incredibly helpful for me to build this Chapel application, which I believe can potentially evolve to be a very useful HPC tool -- simple yet powerful fluid dynamic solver.

1 Like

Hi Brad,

Thinking about how to avoid this issue for other users in the future,
should we generate an error or warning when stencilDist is used to store
owned values? It seems like it'd be unpredictable for users when such
values would be stolen by other locales, and I'd imagine the more common
case would be users doing something like this accidentally and running
into problems.

Lydia

@lydia : Good point. I think a warning for this case would make a lot of sense. Maybe something along the lines of "Because owned classes are transferred upon copy, using an owned array with the stencilDist distribution is unlikely to have the desired effect."

I could imagine cases in which such arrays might make sense, but I think they're enough of a stretch not to worry about until we have a user hit it and get annoyed by the warning.

-Brad

Hi @bradcray
Any ideas on why this seemingly redundant line fixed the issue?

Hi @dutta-alankar

No, I'm afraid I haven't had a chance to look back into this yet this week due to competing deadlines and priorities, though I still hope to (or to convince someone else to take a look).

-Brad

@bradcray no worries! I’m not in a hurry and keep the dev branch untouched. I’ll look forward to the result of someone more experienced looking into the code.. I agree with you that there’s a few unexpected behaviors as I mentioned earlier and knowing their cause will be definitely helpful.

Agreed. And note that I'm fairly certain that these would accurately be described as bugs in the implementation rather than issues in your code. I'm mostly very surprised that we haven't hit them before (to the best of my recollection) as what you're doing does not seem terribly unique or unusual compared to other programs.

No need to preserve the branch in its current state for my sake. As long as you don't force-push in a way that overwrites its history, I can always go back to the SHA in question if it continues to evolve.

-Brad

1 Like