Performance Analytics for Computational Experiments (PACE)
Motivation
Understanding computational performance of a complex coupled model like E3SM poses a singular challenge to domain and computational scientists. Toward that goal, researchers developed PACE (Performance Analytics for Computational Experiments), a framework to summarize performance data collected from E3SM experiments to derive insights and present them through a web portal.
The PACE web portal is available at:
PACE already contains data for more than 5000 experiments for users to explore and additional experiments are being added regularly. Try it yourself!
The primary goal of PACE is to serve as a central hub of performance data to provide an executive summary of E3SM experiment performance.
PACE is designed to enable the following capabilities:
- Interactive analyses and deep-dives into experiments and application sub-regions, as desired,
- Tracking performance benchmarks and simulation campaigns of interest,
- Facilitating performance research on load balancing and process layouts,
- Identification of bottlenecks to inform targeted optimization efforts.
Web Portal Features
The PACE portal provides a powerful search capability, including autocomplete, to select and deeply dive into experiments of interest. Users can search for experiments pertaining to a specific user or platform, or search for a specific model configuration (compset, grid, or case) or a combination thereof. The “search results overview” shows basic experiment metadata such as user, machine, case name etc. Upon selection of a desired experiment, the user is directed to the “experiment details view” that includes the model throughput and process layout for the various components (see figures below).
A user can dive deeply into a particular task’s performance data through an interactive tree graph (see figure at left), which displays “application timers” that log how long certain code blocks took to run. Additionally, a user can select a parallel process or thread for a more detailed view aiding in comparisons with other regions or sub-regions.
The flame graph displayed below illustrates an alternate view of the data that highlights the time spent in different application regions.
PACE also enables performance comparisons across multiple experiments and parallel processes. An interested user can download an experiment’s provenance and raw performance data for further analysis. A detailed demonstration of the portal features is captured in the PACE Web Portal Features video.
Methodology and Infrastructure
The E3SM model incorporates a lightweight performance tracking capability by default. This tracking/profiling capability is provided by designating various application sections of interest using start and stop markers based on the General Purpose Timing Library (GPTL). Whenever an E3SM user runs an experiment on a supported DOE supercomputer, provenance and performance data is automatically collected and archived in project-wide locations. Such aggregated performance data is periodically uploaded to a central server and made accessible to E3SM users and performance specialists through the PACE web portal.
Experiment Upload
All members of the E3SM project can upload their experiments’ performance data through a streamlined process. The PACE server ingests the performance data from an uploaded experiment to store associated provenance and performance in a database. The raw experiment data is parsed to populate various database tables to facilitate visual analytics and interactive performance exploration.
For instructions on how to upload experiments, see the Upload How-To page and the PACE Upload demonstration video.
References
- PACE portal: https://pace.ornl.gov/
- Reference page: PACE
- Videos:
Contact: Sarat Sreepathi, sarat@ornl.gov, https://sarats.com, Oak Ridge National Laboratory
In collaboration with students:
Zachary Mitchell, Pellissippi State Community College
Gaurab KC, University of Tennessee, Knoxville
Acknowledgments:
Special thanks to Patrick H. Worley for incorporating the timing infrastructure and performance archiving capabilities in E3SM which paved the way for PACE.
This research was supported as part of the E3SM project, funded by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research.
The students’ work was partially supported by an appointment to the Science Education and Workforce Development Programs at Oak Ridge National Laboratory, administered by ORISE through the U.S. Department of Energy Oak Ridge Institute for Science and Education.
This research used resources of the Compute and Data Environment for Science (CADES) at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Thanks to Aaron Donahue and Peter Caldwell for sharing their process layout and atmosphere sub-component timing scripts.