New Issue: [Feature Request]: improved HPCToolkit integration

27226, "mppf", "[Feature Request]: improved HPCToolkit integration", "2025-05-09T18:05:07Z"

Summary of Feature

This issue is about improving HPCToolkit + Chapel integration.

Current Status

It's my understanding that HPCToolkit can analyze Chapel programs, but it does not know how hide the details of the qthreads scheduler, and it does not know how to recognize Chapel communication calls.

HPCToolkit usage overview

Because HPCToolkit is frequently run on a combination of desktop/laptop, head node, and compute node, using it consists of running several commands. They are separate commands because they might be run on different systems.

Supposing the program to analyze is called a.out:

  1. hpcrun -e CPUTIME -t ./a.out <args>
  • this requests sampling with cpu time (let it pick the frequency)
  • it will collecting traces of call stacks in the process
  • it will create an hpctoolkit-a.out-measurements directory
  • it will be run on a compute node, normally
  1. hpcstruct hpctoolkit-a.out-measurements
  • analyzes the binaries to support later operations
  • shows which shared libraries are involved
  • modifies the hpctoolkit-a.out-measurements directory to add the results
  • normally run on the head node
  1. hpcprof hpctoolkit-a.out-measurements
  • computes the necessary information for interactive use
  • modifies the hpctoolkit-a.out-measurements directory to add the results
  1. hpcviewer hpctoolkit-a.out-measurements
  • displays the information gathered in the the above steps in an interactive way

In some cases it may be necessary to remove the analysis directory and start again.

In hpcviewer, top view, each row is a separate thread, you can point at a
color and see what the call stack is. It can summarize the threads on a node together into a row. The color bar on the left indicates which core each row corresponds to. You can use the depth selector to choose how deep in the call stacks to show. Or ask to show all leaf calls. Changing depth changes the colors.

Next Steps

The first steps are to enable Chapel to have a similar level of support for HPCToolkit that OpenMP has. The purpose of doing this is to allow the call stack traces to hide implementation details within the qthreads scheduler (or, for that matter, to show task as having a call trace that includes the location where the task was launched).

To work with OpenMP programs, HPCToolkit makes use of the OpenMP Tools interface which is described in https://www.openmp.org/wp-content/uploads/ompt-tr2.pdf and incorporated into later OpenMP standards such as https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf . The goal is to make a similar facility in Chapel.

Here we describe the key parts of the OpenMP Tools interface that HPCToolkit uses which we are seeking to replicate in a similar (but not identical) Chapel tools interface:

  • ompt_start_tool -- this is a weakly-linked function which can be overriden by HPCToolkit (I think, at program runtime, even with LD_PRELOAD or similar). During program launch, this function will be called by the OpenMP implementation, which allows HPCToolkit to register actions for other events.
    • the result of ompt_start_tool has initializer and finalizer function; these should be called by the runtime at the appropriate times (initialize once the runtime is ready to talk to the tool).
  • ompt_function_lookup_t / ompt_set_callback -- provides a way for the tool (HPCToolkit) to request callback on particular events by name. Note that ompt_set_callback works with an enum describing the possible events that can have a callback. ompt_set_callback will use a pointer argument such as void* but each callback needs to have a specified signature.

When working with tasks, HPCToolkit needs to assign a numeric ID for each parallel region and each task. In Chapel, we can think of a parallel region as a coforall or cobegin (and perhaps a sync block?). When a task starts running, HPCToolkit similarly needs to assign a numeric ID to it. The running task ID (assigned by HPCToolkit) needs to be recoverable when HPCToolkit uses a signal handler to interrupt execution. So, these assigned IDs need to be stored in task-local storage.

ompt_get_task_info_t has these fields:

  • task-data -- identifies the task (by what HPCToolkit assigned to it)
  • frame - for removing qthreads scheduler frames
    • stack frame address where we enter the runtime, and stack frame where we
      leave the parallel region (or task?)
    • LLVM's implementation of OpenMP uses these macros for these (which should cause the C compiler to use a base frame pointer for the function calling them):
      #define OMPT_GET_RETURN_ADDRESS(level) __builtin_return_address(level)
      #define OMPT_GET_FRAME_ADDRESS(level) __builtin_frame_address(level)
      
    • note that, a thread in its idle / scheduler loop won't have a "enter the runtime" frame; only an "exit the runtime" frame
    • see the picture at the end of https://www.openmp.org/wp-content/uploads/ompt-tr2.pdf

HPCToolkit will need to register something like 6 callbacks (TODO: which ones?)
HPCToolkit will call 4 "where am I" functions

  • get_frame_address
  • thread_type_get
  • check_state -- get if in barrier etc
  • get_task_frame(i)