NERSC Hackathon on profiling on NVIDIA

  • November 25, 2024
  • Feature Story
  • In collaboration with the NERSC Science Acceleration Program (NESAP), EAMxx group members Noel Keen, Luca Bertagna, and Naser Mahfouz participated in the Open Hackathon Series held virtually at LBNL over August 2024. The group worked closely with Akshay Subramanian from NVIDIA. The group set out to profile the EAMxx model from an application level covering the entire atmosphere–land runtime to a subprocess level in the physics parameterization implementations and the dynamics solver. During the hackathon, the group managed to speed up the EAMxx model by approximately 10 % at NERSC’s Perlmutter GPU system.

    In order to facilitate the profiling of the EAMxx codebase using available tools like Nsight Systems and Nsight Compute, it was necessary to prepare the code with labels for easier interpretation and with build- and run-time configuration options. After implementing the necessary infrastructure changes, the team was able to profile the EAMxx run-time enabling the visualization of bottlenecks and bugs in the code (Fig. 1). Through profiling, the team identified existing bugs in the code and passed the information to the wider development team. They identified potentially unnecessary run-time debug and synchronization activities that lower the performance.

     

    Figure 1. Visualization of bottlenecks in the code.

    Figure 1. Visualization of bottlenecks in the code. In theory (upper half), the kernels are launched on the CPU (denoted by KL) and are executed on the GPU (KE). In the current EAMxx code (lower half), the optimal flow obtainable theoretically is not achieved due to excessive kernel syncing (KS), as well as memory allocation (m) and memory freeing (f) during the model runtime. Additionally, there are debug checks at the end of each process (denoted by small, but frequent, activity on the CPU and GPU).

    The group presented their findings to the entire EAMxx team after the hackathon. The hackathon success was a testament to a balanced team and flexible strategy leading to substantial improvements. Continuation of efforts is needed to debug the code, audit synchronization strategies, and explore further enhancements for sustained performance improvements down the line.

    References

    Send this to a friend