27226, "mppf", "[Feature Request]: improved HPCToolkit integration", "2025-05-09T18:05:07Z"
Summary of Feature
This issue is about improving HPCToolkit + Chapel integration.
Current Status
It's my understanding that HPCToolkit can analyze Chapel programs, but it does not know how hide the details of the qthreads scheduler, and it does not know how to recognize Chapel communication calls.
HPCToolkit usage overview
Because HPCToolkit is frequently run on a combination of desktop/laptop, head node, and compute node, using it consists of running several commands. They are separate commands because they might be run on different systems.
Supposing the program to analyze is called a.out
:
hpcrun -e CPUTIME -t ./a.out <args>
- this requests sampling with cpu time (let it pick the frequency)
- it will collecting traces of call stacks in the process
- it will create an
hpctoolkit-a.out-measurements
directory - it will be run on a compute node, normally
hpcstruct hpctoolkit-a.out-measurements
- analyzes the binaries to support later operations
- shows which shared libraries are involved
- modifies the
hpctoolkit-a.out-measurements
directory to add the results - normally run on the head node
hpcprof hpctoolkit-a.out-measurements
- computes the necessary information for interactive use
- modifies the
hpctoolkit-a.out-measurements
directory to add the results
hpcviewer hpctoolkit-a.out-measurements
- displays the information gathered in the the above steps in an interactive way
In some cases it may be necessary to remove the analysis directory and start again.
In hpcviewer, top view, each row is a separate thread, you can point at a
color and see what the call stack is. It can summarize the threads on a node together into a row. The color bar on the left indicates which core each row corresponds to. You can use the depth selector to choose how deep in the call stacks to show. Or ask to show all leaf calls. Changing depth changes the colors.
Next Steps
The first steps are to enable Chapel to have a similar level of support for HPCToolkit that OpenMP has. The purpose of doing this is to allow the call stack traces to hide implementation details within the qthreads scheduler (or, for that matter, to show task as having a call trace that includes the location where the task was launched).
To work with OpenMP programs, HPCToolkit makes use of the OpenMP Tools interface which is described in https://www.openmp.org/wp-content/uploads/ompt-tr2.pdf and incorporated into later OpenMP standards such as https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf . The goal is to make a similar facility in Chapel.
Here we describe the key parts of the OpenMP Tools interface that HPCToolkit uses which we are seeking to replicate in a similar (but not identical) Chapel tools interface:
ompt_start_tool
-- this is a weakly-linked function which can be overriden by HPCToolkit (I think, at program runtime, even with LD_PRELOAD or similar). During program launch, this function will be called by the OpenMP implementation, which allows HPCToolkit to register actions for other events.- the result of
ompt_start_tool
has initializer and finalizer function; these should be called by the runtime at the appropriate times (initialize once the runtime is ready to talk to the tool).
- the result of
ompt_function_lookup_t
/ompt_set_callback
-- provides a way for the tool (HPCToolkit) to request callback on particular events by name. Note thatompt_set_callback
works with anenum
describing the possible events that can have a callback.ompt_set_callback
will use a pointer argument such asvoid*
but each callback needs to have a specified signature.
When working with tasks, HPCToolkit needs to assign a numeric ID for each parallel region and each task. In Chapel, we can think of a parallel region as a coforall
or cobegin
(and perhaps a sync
block?). When a task starts running, HPCToolkit similarly needs to assign a numeric ID to it. The running task ID (assigned by HPCToolkit) needs to be recoverable when HPCToolkit uses a signal handler to interrupt execution. So, these assigned IDs need to be stored in task-local storage.
ompt_get_task_info_t
has these fields:
- task-data -- identifies the task (by what HPCToolkit assigned to it)
- frame - for removing qthreads scheduler frames
- stack frame address where we enter the runtime, and stack frame where we
leave the parallel region (or task?) - LLVM's implementation of OpenMP uses these macros for these (which should cause the C compiler to use a base frame pointer for the function calling them):
#define OMPT_GET_RETURN_ADDRESS(level) __builtin_return_address(level) #define OMPT_GET_FRAME_ADDRESS(level) __builtin_frame_address(level)
- note that, a thread in its idle / scheduler loop won't have a "enter the runtime" frame; only an "exit the runtime" frame
- see the picture at the end of https://www.openmp.org/wp-content/uploads/ompt-tr2.pdf
- stack frame address where we enter the runtime, and stack frame where we
HPCToolkit will need to register something like 6 callbacks (TODO: which ones?)
HPCToolkit will call 4 "where am I" functions
get_frame_address
thread_type_get
check_state
-- get if in barrier etcget_task_frame(i)