Long Term Roadmap

The E3SM Project Roadmap, showing the relative sequencing of major simulation campaigns, model version development, and machine deployment addressed in each project phase.

To achieve the Grand Challenge of actionable Earth system predictions to address the most critical scientific questions facing our nation and DOE, the E3SM project defined its strategy in a fifteen-year roadmap with four intersecting project elements:

  1. A series of projection and simulation experiments addressing scientific questions and mission needs;
  2. A well-documented and tested, continuously advancing, evolving, and improving system of model codes that comprise the E3SM Earth system model;
  3. The ability to use effectively leading (and “bleeding”) edge computational facilities soon after their deployment at DOE national laboratories; and
  4. An infrastructure to support code development, hypothesis testing, simulation execution, and analysis of results.

The Figure depicts the E3SM Project Roadmap, showing the relationships among the first three major project elements: the simulations, the modeling system to perform those simulations, and the machines on which they will be executed. Unlike the other three elements that have distinct but overlapping phases, the fourth element, the infrastructure, will evolve continuously based on the requirements imposed by project needs.

Element 1: Major Simulations

E3SM’s mission-relevant, grand challenge science questions and the simulations envisioned to answer those questions by 2027 drive its strategy. With extensive input from both DOE and external scientific experts, the project narrowed its focus at the project’s outset to three critically important Earth-system science drivers that strongly influence, and are influenced by, the energy system: the water cycle, biogeochemistry, and the cryosphere system. These science drivers integrate the understanding of Earth system processes that is foundational to Earth system prediction. The envisioned grand challenge simulations of the coupled human-Earth system at ultra-high resolution are not yet possible with current model and computing capabilities. Nevertheless, the project has developed a set of achievable experiments that make major advances toward answering the grand challenge questions using a modeling system for v1 to run on leadership computing architectures available to the project now. Like all research projects, the early results will be used to refine science questions and develop new testable hypotheses to be addressed with subsequent versions of the modeling system. As shown in the Figure, E3SM envisions simulation campaigns of four to five years each with successive versions of the modeling system. Every campaign will inform the next, and the project envisions four campaigns with successive versions of the modeling system leading to the simulations that fully address the overarching science questions  in approximately 10 years.

Element 2: Model Development

The core of the E3SM project is science-driven model development. Priorities are determined by the needs to accurately simulate the processes and overall behaviors of the Earth’s water cycle, biogeochemistry, and cryosphere by the fully-coupled modeling system. This element connects the science needs with computing power provided by the DOE Office of Science (DOE-SC). E3SM is the only Earth system modeling project to target DOE Leadership Computing Facilities (LCFs) as its primary computing architectures. The E3SM model development path currently envisions three additional development cycles for its modeling system over the next 10 years. The staged nature of model development is depicted in the Figure, which recognizes that multiple versions are undergoing different stages of development and testing at any given time. The project is committed to fully documenting and testing each major model release, including benchmarking runs for performance and control simulations for scientific evaluation. E3SM has completed its v1 modeling system version, which is being used in the v1 experiment campaign, and the v1 model code, simulation configurations, and model output from an initial set of runs will be released in April 2018. The version 2 experimental campaign, which will be described in subsequent sections, serves as the starting point for v2 development priorities. This version will bridge the current LCF architectures with the next generation, 100 x1015 (or greater) floating-point operations per second (PFLOP) systems just now being configured. Scientifically, version v3 initial requirements is being informed by analysis of the early v1 simulations.

Element 3: Leadership Architectures

E3SM cannot overemphasize the interdependence of the E3SM project and DOE LCF resources. High-end scientific computing has been relatively evolutionary and predictable since the adoption twenty years ago of massively-parallel, microprocessor-based, distributed memory supercomputers. From the late 1980s through the late 1990s, these systems, such as the Cray T3E and IBM-SP,  ultimately replaced the large, special purpose shared-memory vector systems that were the mainstay of many scientific communities, including weather and Earth-system simulation. That disruptive decade required Earth system modelers to experiment with different architectures and programming models to sustain progress.

Analogously, high-end scientific computing has entered a new disruptive period. As microprocessors have become ever smaller, they are approaching limitations dictated by power consumption and heat generation, requiring new and experimental processor designs that are the building blocks for the next generations of high-end computers. Achieving grand challenge science goals necessitates that the E3SM project lead the Earth system modeling community to adapt to unprecedented changes in the computing landscape. These challenges include slower system clock speeds and increased software complexity driven by computing system heterogeneity. The hardware path to Exascale is far from settled. Over the last year, for example, Intel has announced the end of the Xeon Phi architecture, which was the foundation for the now-cancelled 100 PFLOPS machine that was to be delivered to Argonne National Laboratory’s LCF (ALCF) in 2018. (https://itpeernetwork.intel.com/unleashing-high-performance-computing/).

Supporting current and next-generation DOE machines while also preparing for exascale architectures will continue to require a mix of short-term optimization and intermediate or long-term research in algorithms and computational approaches.  The 100+ PFLOP “Summit” computer is now being installed at Oak Ridge National Laboratory’s LCF (OLCF), but there are still unknowns about the capabilities of the machine and uncertainties as to whether the Summit design and programming model will ultimately be the best for Exascale computing.

The DOE Exascale Computing Project now expects two “Exascale capable” machines to be available in 2021, two years earlier than previously projected.  (https://exascaleproject.org/wp-content/uploads/2017/04/Messina-ECP-Presentation-HPC-User-Forum-2017-04-18.pdf). The ALCF machine will be based on a new Intel chip architecture that is still in the prototype phase.   E3SM expects the OLCF  system will achieve exascale performance through GPU acceleration.   The goal is to ensure E3SM is ready to achieve grand challenge science goals on both of these systems.

Element 4: E3SM Infrastructure

The priority science drivers and resulting three-year experiments were used to define the functionality of the initial simulation system. Infrastructure design is based on the requirements to facilitate hypothesis-testing workflows (configuration, simulation, diagnostics, analysis). As mentioned above, the infrastructure element is continuously evolving and does not have the distinct phasing of the other three elements. The infrastructure will maintain a disciplined software engineering structure and build turnkey workflows to enable efficient code development, testing, simulation design, experiment execution, analysis of output, and distribution of results within and outside the project. Since the E3SM model system will be made available to users outside of the project, the infrastructure includes the documentation and software repositories expected of open-source software systems.

Project Phasing and its Relationship to Model Versions

Administratively, DOE has supported the E3SM project under DOE-BER”s Scientific Focus Area structure, which requires a full proposal and complete peer-review every three to four years.  Each period covered under a single proposal as a project “phase.” During each phase, the project develops and runs multiple versions of the E3SM system. For example during phase 1 (2014-2018), E3SM v1 was developed and simulations using both the v0 and v1 models were performed. Many model features developed during phase 1 will be integrated and tested in the v2 model. E3SM Phase 2 (2018-2021) activities include completion of the v1 simulations, development of E3SM v2 and associated simulations driven by the v2 science questions, and development of the v3/v4 model to be used in major simulations in Phase 3/4 from 2021 to 2027.

Send this to a friend