Hi Chapel support team,
I’ve been receiving the error mentioned exactly in this comment (Segmentation Fault when attempting to write to file without file permissions · Issue #10204 · chapel-lang/chapel · GitHub) and due to creating files using IO functions on remote locales. I’ve read the post about resolving the issue, but it’s not working. Would you please help me with this compiler bug?
Thank you,
Marjan
Hi Marjan —
I'm close to timing out for the day, but for tomorrow (whether it's me or someone else who picks this up), would you let us know:
- whether you're using a system with a shared file system that is visible by all locales, or one that has a local file system per node?
- whether it would be possible to create a small Chapel code reproducing the behavior you're seeing?
Thanks,
-Brad
Hi Brad,
Of course, thank you, I’ll answer now for tomorrow.
1- No, it’s not a shared FS. Each locale has its own file system but this file system “may” be used by other users as well. Each locale is a share computer with other users but for my chapel program, locales are totally independent from each other.
2- I can’t since it doesn’t happen “always” and on one specific locale. Each time I run the program, I have totally different nodes, so I can’t track nodes, I have a while loop on each locale in which this text file is created repeatedly. For example, in the fifth time of creation, or the 20th time, I see this error and program crashes. Sometimes it doesn’t happen at all, or something it just shows gasnet error without that error message which points out to that text file; but recently I’m seeing it more often. I’m wondering it I give a permission to the folder containing this file would fix the problem?
Thank you,
Marjan
Hi Marjan —
Can you cut and paste a snapshot of the specific error output that you're getting here?
Also, is your program compiled with --fast
, and if so, could you try re-compiling it and re-running it (a) without --fast
or (b) with --fast --checks
to see whether you get a different behavior?
Thanks,
-Brad
Hi Brad,
Yes of course. I'd been receiving the following error yesterday.
*** Caught a fatal signal (proc 0): SIGSEGV(11)
NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
Spawner: read() returned 0 (EOF)
*** Caught a signal (proc 8): SIGTERM(15)
*** Caught a signal (proc 1): SIGTERM(15)
*** Caught a signal (proc 6): SIGTERM(15)
*** Caught a signal (proc 4): SIGTERM(15)
*** Caught a signal (proc 2): SIGTERM(15)
*** Caught a signal (proc 5): SIGTERM(15)
*** Caught a signal (proc 7): SIGTERM(15)
*** Caught a signal (proc 3): SIGTERM(15)
*** FATAL ERROR (proc 5): in gasnetc_ofi_tx_poll() at /third-party/gasnet/gasnet-src/ofi-conduit/gasnet_ofi.c:1132: fi_cq_read for tx_poll failed with error: Transport endpoint is n>NOTICE: Before reporting bugs, run with GASNET_BACKTRACE=1 in the environment to generate a backtrace.
NOTICE: We recommend linking the debug version of GASNet to assist you in resolving this application issue.
uncaught PermissionError: Permission denied (in open with path "/calibration.txt")
/project/def-wayang/masgari/ChapelMulti/ChapelProject/runIMWEBs.chpl:97: thrown here
/project/def-wayang/masgari/ChapelMulti/ChapelProject/runIMWEBs.chpl:97: uncaught here
/cvmfs/soft.computecanada.ca/gentoo/2020/bin/cp: cannot stat
srun: error: cdr1331: task 4: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=33743105.1
slurmstepd: error: *** STEP 33743105.1 ON cdr1056 CANCELLED AT 2022-05-16T16:05:23 ***
and Today, it's showed me:
/project/def-wayang/masgari/ChapelMulti/ChapelProject/ParameterSet.chpl:1042: error: Out of memory allocating "string copy data"
Spawner: read() returned 0 (EOF)
*** Caught a signal (proc 7): SIGTERM(15)
*** Caught a signal (proc 6): SIGTERM(15)
*** Caught a signal (proc 0): SIGTERM(15)
*** Caught a signal (proc 4): SIGTERM(15)
*** Caught a signal (proc 8): SIGTERM(15)
*** Caught a signal (proc 3): SIGTERM(15)
*** Caught a signal (proc 5): SIGTERM(15)
*** Caught a signal (proc 1): SIGTERM(15)
I don't know if they are relatable to each other.
Thank you,
Marjan
I have not complied with fast flag, so, I will do so and will let know of the output.
The function -->
proc generateCalibrationTXT(){
var GlobalPars: list(shared Parameter, parSafe=true) = CalibrationParSet;
for par in CalibrationParSet{
if (par.getParameterType() != "General"){
GlobalPars.remove(par);
}
}
var FileIsCreated = false;
try {
var full_path: string = "/".join(ApplicationDir,"calibration.txt");
var myFile = open(full_path, iomode.cw);
var WritingChannel = myFile.writer();
var AddedPars = 0;
for i in GlobalPars {
var partext = i.getCalibrationText();
WritingChannel.writeln(partext);
AddedPars = AddedPars + 1;
}
WritingChannel.close();
myFile.fsync();
myFile.close();
if (AddedPars == GlobalPars.size){
FileIsCreated = true;
}
} catch {
writeln("There is an Error in Creating the Calibration Text File!");
FileIsCreated = false;
}
setCalTextFile(FileIsCreated);
}
If you're not using --fast
then it will only make the output worse, so don't bother (essentially, by default, Chapel will have a number of safety checks turned on which can hurt performance, but make errors easier to find; if you hit seg faults, a good practice is to turn off --fast
or turn on --checks
in order to get that checking enabled again).
(However, once you're doing production / performance-oriented runs, you probably will want to compile with --fast
to get better performance from Chapel).
The thing that stands out for me in your error is:
PermissionError: Permission denied (in open with path "/calibration.txt")
which suggests that you're trying to write to /
which most users don't have the permissions for (so the permissions error seems reasonable). Given the code you sent, I think the question is therefore what would cause ApplicationDir
to be an empty string?
And the other question is why your try...catch
wouldn't be catching this error. Which line is runIMWEBs.chpl:97
Thanks,
-Brad
Thank you, Brad.
The ApplicationDir cannot be an empty string; This variable gets initialized when creating an object of the class containing this function. So, if it is would be an empty string, the program would crash at the stage of instantiating the class.
In fact, I added the try-catch block after receiving the error. But so far I have not been successful to catch it.
runIMWEBs.chpl:97 --> var myFile = open(full_path, iomode.cw);
I don't know if it's because of the memory issues, but each time of running the "same" program a new compiler error shows up, such as Out of memory allocating "string copy data" as the last one. So, sorry I haven't been able to catch the error message so far.
Thanks,
Marjan
Hi Marjan —
Note that there's nothing about initializing a string field with an empty string (by which I mean ""
, not a C-style NULL char*
pointer, which Chapel doesn't have an equivalent of) that would crash Chapel inherently (unless something in your object initialization code would crash in that case). To humor me, I'd be curious to have you put code like:
assert (ApplicationDir != "", "ApplicationDir is empty");
into your code just before the declaration of full_path
to see what happens (or you could just put a debugging print of writeln("ApplicationDir is: '", ApplicationDir, "'");
in to see what it is on each invocation.
The out-of-memory errors are concerning, though since they're on string data, it could potentially be a case of trying to allocate a too-big string. What is the line of code where that error is being issued? It's too bad that error message doesn't print out the size that it's trying to allocate... I wonder if we could change it to get that information.
I'm also curious whether you're able to run this program in a single-locale mode on your desktop (say) and whether it works OK there. Generally speaking, if you can, that will probably be easier to debug than on a cluster using GASNet. And it seems likely that these errors would occur in either setting.
-Brad
Thank you for the suggestion, I will put this line of code in my program to get more info about the potential reasons causing this error.
Regarding the out of memory error: It's a setter method. I read one column of one row of a database and use this setter to initialize a variable; it should be one word. So, I think probably the database had been crashed and it had given a funny output to chapel (which again gets back to the potential memory problems). Thank you for letting me know of the reasons that this error can be received.
If I want to run the program on single-locale, then the program won't be the same program since in my program Locale 0's responsibilities/data are totally different from the other locales'. So, if I want to run on one locale many parts of the program should be deleted or modified, so probably some errors will also be disappeared and appeared.
Thank you for all help,
Marjan
1 Like
Let us know if there's more that we can do to help. My main reaction so far is that I wouldn't assume that there's a bug in the compiler or language, but most likely in the Chapel program itself. That said, there are some oddities here that I can't explain (like why your try...catch
isn't catching), or that are surprising to you (like why you're getting an empty string where you don't think you should). So if it were me, I'd try to decorate key parts of my code with assertions or debugging statements like the one proposed above to figure out "Why does this string seem to be empty?" and then "Was it empty when I created this object?" It very well could be that you're hitting a bug in Chapel, but we'd need more information in order to be able to act on it.
With respect to one vs. multiple locales... Since Chapel is multithreaded, is it not possible to use a cobegin
or begin
(say) to kick off a task to do the locale 0 job and another task to do the locale 1 job? If this is a major restructuring to your code, it may not be worth it, but if it's minor, it may save you a lot of time and pain by permitting you to debug locally and without GASNet in the mix.
-Brad
1 Like
Hi Brad,
After investigating a lot and literally putting a try-catch block around each single line of code, I figured out the errors such as
Spawner: read() returned 0 (EOF)
Spawner: read() returned -1 errno = 9(Bad file descriptor)
can "also" be caused by calling the
copyFile()
function from the Filesystem module.
In fact, generally, when we look up these errors to learn the causations, mostly we find that there may be an issue with opening or closing files, not copying. By using the SystemError class in the catch block to get the error message, I learned that on my Locale0, right before creating the text file, I was calling the copyFile() to make a copy of the file before changing it, and the "No such file or directory" error was showing itself in two above disguises.
In addition, since I have my Chapel program on two remote locations (and I connect using ssh in visual studio code), I observed that in one copy of the program, the code for closing the file after writing to it was missing.
I had :
WritingChannel.close();
But I did not have:
myFile.close();
This was the reason of seeing the following error:
uncaught PermissionError: Permission denied (in open with path "/calibration.txt")
Thank you so much again for all help and support,
Marjan
Hi Marjan —
I'm glad you're sorting out the issues. If there are behaviors in these routines that you think should change or be clarified or generate better errors/warnings to help prevent others from stepping into similar holes as you did, that's feedback that I think we'd like to receive.
Have a good weekend,
-Brad