Performance Analytics for Computational Experiments (PACE)

  • November 13, 2019
  • Feature Story,Home Page Feature
  • Atmosphere sub-components timing for multiple experiments across the High-res Water Cycle simulations.

    Motivation

    Understanding computational performance of a complex coupled model like E3SM poses a singular challenge to domain and computational scientists. Toward that goal, researchers developed PACE (Performance Analytics for Computational Experiments), a framework to summarize performance data collected from E3SM experiments to derive insights and present them through a web portal.

     

    The PACE web portal is available at:

    https://pace.ornl.gov/

    PACE already contains data for more than 5000 experiments for users to explore and additional experiments are being added regularly. Try it yourself!

    Simulation year per day (SYPD) versus total number of processors for High-res Water Cycle simulation campaign on Theta, along with search results.

    The primary goal of PACE is to serve as a central hub of performance data to provide an executive summary of E3SM experiment performance.

     

    PACE is designed to enable the following capabilities:

    • Interactive analyses and deep-dives into experiments and application sub-regions, as desired,
    • Tracking performance benchmarks and simulation campaigns of interest,
    • Facilitating performance research on load balancing and process layouts,
    • Identification of bottlenecks to inform targeted optimization efforts.

    Web Portal Features

    The PACE portal provides a powerful search capability, including autocomplete, to select and deeply dive into experiments of interest. Users can search for experiments pertaining to a specific user or platform, or search for a specific model configuration (compset, grid, or case) or a combination thereof. The “search results overview” shows basic experiment metadata such as user, machine, case name etc. Upon selection of a desired experiment, the user is directed to the “experiment details view” that includes the model throughput and process layout for the various components (see figures below).

     

    Example of a PACE “experiment details view” that shows the mapping of parallel processes to model components along with simulation time spent

    A tabular form details model run time and throughput per component.

     

    A tree graph view of application performance. Users can click on a specific application timer to dive deeper into the hierarchy to see the timing breakdown in sub-regions of interest.

    A user can dive deeply into a particular task’s performance data through an interactive tree graph (see figure at left), which displays “application timers” that log how long certain code blocks took to run. Additionally, a user can select a parallel process or thread for a more detailed view aiding in comparisons with other regions or sub-regions.

    The flame graph displayed below illustrates an alternate view of the data that highlights the time spent in different application regions.

     

    A flame graph concisely summarizes the time spent in various application regions.

    PACE also enables performance comparisons across multiple experiments and parallel processes. An interested user can download an experiment’s provenance and raw performance data for further analysis. A detailed demonstration of the portal features is captured in the PACE Web Portal Features video.

     

    Methodology and Infrastructure

    The E3SM model incorporates a lightweight performance tracking capability by default. This tracking/profiling capability is provided by designating various application sections of interest using start and stop markers based on the General Purpose Timing Library (GPTL). Whenever an E3SM user runs an experiment on a supported DOE supercomputer, provenance and performance data is automatically collected and archived in project-wide locations. Such aggregated performance data is periodically uploaded to a central server and made accessible to E3SM users and performance specialists through the PACE web portal.

    Experiment Upload

    All members of the E3SM project can upload their experiments’ performance data through a streamlined process. The PACE server ingests the performance data from an uploaded experiment to store associated provenance and performance in a database. The raw experiment data is parsed to populate various database tables to facilitate visual analytics and interactive performance exploration.

    For instructions on how to upload experiments, see the Upload How-To page and the PACE Upload demonstration video.

     

    References

    Contact: Sarat Sreepathi, sarat@ornl.gov, https://sarats.com, Oak Ridge National Laboratory

    In collaboration with students:
    Zachary Mitchell, Pellissippi State Community College
    Gaurab KC, University of Tennessee, Knoxville

    Acknowledgments:
    Special thanks to Patrick H. Worley for incorporating the timing infrastructure and performance archiving capabilities in E3SM which paved the way for PACE.

    This research was supported as part of the E3SM project, funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research.
    The students’ work was partially supported by an appointment to the Science Education and Workforce Development Programs at Oak Ridge National Laboratory, administered by ORISE through the U.S. Department of Energy Oak Ridge Institute for Science and Education.
    This research used resources of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

    Thanks to Aaron Donahue and Peter Caldwell for sharing their process layout and atmosphere sub-component timing scripts.

    Send this to a friend