E3SM – A Decade of Progress: Then & Now
While we’ve been celebrating all year, October 2024 officially marked a decade of development in E3SM. The project has significantly progressed since its inception in October 2014.
Throughout the year we’ve been highlighting this important milestone with stories:
- E3SM – A Decade of Progress: A Timeline,
- E3SM – A Decade of Progress: In Numbers,
- E3SM – A Decade of Progress: Lessons Learned
and fun facts about the project:
Today, we will explore the evolution of E3SM, examining its past and present, focusing on three areas: computational resources, code review evolution, and organizational and leadership changes.
The project started as Accelerated Climate Modeling for Energy (ACME) before transitioning to its current name Energy Exascale Earth System Model (E3SM) in April 2018, which is also when the first official version of the model, v1, was released.
The E3SM Project has now reached an exciting juncture. It is on the verge of achieving its original ten-year goal articulated in 2014: build a world-class Earth system model for the US Department of Energy (DOE) mission on Exascale computation platforms. Today the project has the model of Atmosphere (EAMxx) written in C++ and fully capable of running efficiently on a GPU exascale supercomputers. Another computationally expensive component that is being re-written in C++ and GPU-enabled is the Ocean Model for E3SM Global Applications (OMEGA), which will be the ocean component of E3SM version 4. E3SMv4, scheduled to be released in 2027, will be the first exascale capable coupled earth system model.
Computational Resources
The development of a sophisticated climate model relies heavily on the availability of computational resources at all stages of the model development: the design/implementation stage, the evaluation and testing stage, and for running simulations. The decadal goal of building an earth system model for exascale emphasized the need for E3SM to target capability-class supercomputers at the DOE Leadership Computing Facilities at Oak Ridge Leadership Computing Facility (OLCF) and Argonne Leadership Computing Facility (ALCF). Access to these flagship supercomputers is awarded through very competitive computational allocation programs namely INCITE and ALCC. Additionally, the National Energy Research Scientific Computing Center (NERSC) provides capacity-class computing and allocations are through the ERCAP program. Application readiness efforts to prepare for upcoming machines are facilitated through early access programs such as NESAP, CAAR and ESP programs. Over the past decade, the E3SM project has consistently secured significant computational allocations through these programs on Edison, Titan, Mira, Cori, Summit, Perlmutter, Frontier, and Aurora supercomputers.
However, the project had only limited access to capability-class computational resources that are better suited for long-running climate simulations during the first few years. This heavily impacted the speed of model development.
To conquer the bottleneck of computational allocations, the E3SM project, thanks to the support from the DOE BER, has purchased its own computing resources to meet the project’s goals. There are now 3 clusters and 1 large storage system available for the project.
Anvil at Argonne National Laboratory (ANL) was purchased in 2016, doubling in size in 2017. Its 100% allocation to E3SM was critical to the success of the v1 simulation campaign.
Compy at Pacific Northwest National Laboratory (PNNL) became operational in 2019, and was purchased to support development and simulation of E3SM for work funded by several BER projects. The available time is split among three accounts, with 50% of its allocation going towards E3SM SFA project. This helped with E3SM development immensely. (Check out recent fun facts article about Compy’s name).
Chrysalis at ANL became operational in 2021, in time to contribute heavily to the v2 simulation campaign. The vast majority of v2 simulations (and v3 simulations, ongoing) were done on this machine.
Dedicated machines quick specs for comparison:
- Anvil:
- 240 nodes. 2×18 Intel Xeon Broadwell CPUs, 64GB/node
- 2.0M node hours per year
- Compy
- 460 nodes. 2×20 Intel Xeon Skylake CPUs, 192GB/node
- 4.0M node hours total (2M for E3SM) per year
- Chrysalis
- 512 nodes, 2×32 AMD Epyc 7532s, 256GB/node
- 4.4M node hours per year
Archival disk: E3SM added a dedicated, large storage of 9PB disk maintained at Lawrence Livermore National Laboratory. This was a critical part of the data processing and publication workflow throughout the years. The storage is no longer covered by maintenance license, and we are transitioning the data archives to NERSC and publication work flow to Chrysalis/ALCF. The storage on the computational resources is yet another bottleneck for model development and simulations, and we have implemented scrubbers to delete old data and free up space as needed.
Our Code Review Processes
When E3SM first began in 2014, there was only an informal code review process centered on reviews of github.com Pull Requests that checked for obvious coding errors. We combined this with nightly testing of several cases on different machines and compilers. In 2019, we established the first version of a more formal Code Review Process (CRP1- internal link), which included requirements for the Design Document, extensive testing with verification, validation and performance assessment. We also kept expanding the number of cases and machine/compiler combinations in our testing.
In 2021 we decided to revisit the code integration policy. There were several reasons motivating this, most importantly: (1) there was the expectation that there may be new features coming into E3SM from other BER funded projects and (2) some of the delay in the E3SMv2 tuning effort was due to insufficient testing or feature evaluation during the integration process, where features were added to the code in the “off” state and were not adequately tested in the “on” state within the coupled system. In 2022 the second, improved version of the Code Review Process was formalized and is now in use. The biggest change is the addition of requirement of both component and coupled model testing of all features in an “on” state even if they are ultimately integrated with an “off” setting.
E3SM Organization and the Leadership Team
Both the organization and the Leadership Team changed quite a bit over the decade.
At the beginning the project’s structure was organized around the components (with “Atmosphere”, “Land” and “Ocean” groups), the “Coupled System” group and the engineering groups (“Software Engineering/Coupler”, “Workflow” and “Performance and Algorithms”), see Figure 1. Each group was supported by two Group Leads, ensuring shared leadership responsibilities and fostering collaboration among leaders and expertise from two distinct laboratories, all aimed at achieving common objectives.
Phase 1 Leadership Team consisted of:
- Atmosphere Group Leads: Phil Rasch (PNNL), Shaocheng Xie(LLNL)
- Land Group Leads: Peter Thornton (ORNL), Bill Riley (LBNL), added later Kate Calvin(PNNL)
- Ocean/Ice Group Leads: Todd Ringler (LANL), Steve Price (LANL),
- Performance Group Leads: Phil Jones (LANL), Mark Taylor (SNL) replaced in 2015 by Pat Worley (ORNL)
- Software Engineering and Coupler Group Leads: Robert Jacob (ANL), Andy Salinger (SNL)
- Workflow Group Leads: Dean Williams (LLNL), Kate Evans (ORNL) replaced in fall of 2015 by Val Anantharaj (ORNL)
It is important to highlight that at the project’s inception, Bill Collins served as the Chief Scientist, while Hans Johanson held the position of Chief Computational Scientist, both at LBNL.
In Phase 2 (see Figure 2), E3SM shifted its focus towards long-term goals, adopting a higher-risk research approach alongside a more straightforward yet still complex model development. Consequently, the project dedicated its efforts to two primary categories of subprojects:
- the “Core Activities” encompassing completion of the v2 model and execution of the v2 simulation campaign:
- Water Cycle – Christophe Golaz (LLNL), leader; Luke van Roekel (LANL), deputy
- Cryosphere – Todd Ringler (LANL), leader; Wuyin Lin (BNL), deputy
- Biogeochemical Cycles – Katherine Calvin (PNNL), leader; Susannah Burrows (PNNL), deputy
- Infrastructure and Data Management – Robert Jacob (ANL), leader; Chengzhu Zhang (LLNL), deputy
- Performance – Philip Jones (LANL), leader; Sarat Sreepathi (ORNL), deputy
- the “Next Generation Development (NGD) Activities“ that focused on development of a new science and computational capabilities for E3SM versions v3 and v4 with a path for integration into E3SM within 5 years (NGD Sub-Projects):
- NGD Atmospheric Physics: Shaocheng Xie (LLNL)
- NGD Land and Energy: Ben Bond-Lamberty (PNNL)
- NGD Nonhydrostatic Atmosphere: Peter Caldwell (LLNL)
- NGD Software and Algorithms: Andy Salinger (SNL)
- NGD BISICLES: Dan Martin (LBNL)
- NGD Coastal Waves: Phillip Wolfram (LANL)
In addition, we had decided to restructure the leadership of the groups. Instead of having two equal leaders, we now had one leader and a deputy. This change aimed to streamline decision-making, as having two leaders occasionally resulted in challenges when making difficult choices.
In phase 3 the E3SM project’s functional organization was realigned to better address the Nation’s strategic challenges. This refocus emphasized actionable projections to achieve its long-term goal of producing actionable science simulations with a state-of-the-art Earth system model on Exascale Computing systems.
To align the project more explicitly with the goal of actionable modeling and projections of Earth system variability and change, the simulation campaign groups were renamed: the Water Cycle group is now “Water Cycle Changes and Impacts,” the BGC group is now “Human-Earth System Feedbacks,” and the Cryosphere is now “Polar Processes, Sea-Level Rise and Coastal Impacts” (see Fig. 3).
Phase 3 Leadership is composed of:
Executive Committee
- Project PI: Dave Bader (LLNL)
- Chief Scientist: Ruby Leung (PNNL)
- Chief Computational Scientist: Mark Taylor (SNL)
- Chief Operating Officer and Project Engineer: Renata McCoy (LLNL)
Phase 3 Groups
- Coupled Model Group: Lead: Chris Golaz (LLNL), Deputy: Wuyin Lin (BNL)
- Water Cycle Changes and Impacts Group: Lead: Bryce Harrop (PNNL), Deputy: Claudia Tebaldi (PNNL)
- Human-Earth System Feedbacks Group: Lead: Ben Bond-Lamberty (PNNL), Deputy: Jennifer Holm (LBNL), Deputy: Nicole Jeffery (LANL), Deputy: Qing Zhu (LBNL)
- Polar Processes, Sea-Level Rise, and Coastal Impacts Group: Lead: Stephen Price (LANL), Deputy: Andrew Roberts (LANL)
- Infrastructure Group: Lead: Robert Jacob (ANL), Deputy: Chengzhu Zhang (LLNL), Deputy: Sarat Sreepathi (ORNL)
- Atmosphere Group: Lead: Shaocheng Xie (LLNL), Deputy: Susannah Burrows (PNNL)
- EAMxx Group: Lead: Peter Caldwell (LLNL), Deputy: Susannah Burrows (PNNL)
- Omega Group: Lead: Luke van Roekel (LANL), Deputy: Steven Brus (ANL), Deputy: Mark Petersen (LANL)
- Ice GroupL: Lead: Elizabeth Hunke (LANL) , Deputy: Darin Comeau (LANL)
- Land Group: Lead: Peter Thornton (ORNL), Deputy: Gautam Bisht (PNNL)
Cross-Cutting Coordinators
- Machine Learning/Artificial Intelligence Coordinator: Lead: Andy Salinger (SNL)
- Performance Coordinator: Lead: Sarat Sreepathi (ORNL)
E3SM’s Program Manager at the Department of Energy has also changed over the years. Dorothy Koch was instrumental in bringing this project to life, guiding it through the successful release of version 1 of the E3SM model. She served as our first Project Manager until stepping away in 2019 to explore new opportunities. We are grateful for her contributions. Following her departure, Sally McFarlane generously stepped up to serve as Interim Program Manager, ensuring continuity and support during the transition. In 2020 the E3SM project was fortunate to welcome Xujing Davis as our current Program Manager. She continues to lead us with unwavering commitment and a clear vision for the future.
This article is a part of the E3SM “Floating Points” Newsletter, to read the full Newsletter check: