Computational Research
E3SM key computational goal is to develop an Earth System Model that can make effective use of upcoming DOE exascale computing platforms at the ALCF and OLCF Leadership Computing Facilities (LCFs) as well as the smaller systems at NERSC. A roadmap showing the current machines at these centers as well as our best estimate for the future roadmap is shown in the above figure.
At the beginning of the E3SM project in 2014, we were running the high resolution E3SM v0 model on the 10+ PFLOPS OLCF Titan and ALCF Mira systems. As we developed v1, we targeted the future 100+ PFLOPS Summit and Aurora systems, while making use of the available Titan, Mira and NERSC Edison systems. A strong push was made for v1 to efficiently utilize new systems that appeared after the start of the project based on the Intel Xeon Phi (KNL) processor: NERSC Cori and ALCF Theta. Following the demise of this Exascale pathway, we have modified our roadmap to target two new systems in Phase II, only one of which is fully known. The OLCF Summit will be available in 2018. It will have 4600 nodes, each with 2 IBM Power9 cpus and 6 NVIDIA Volta GPUs. The second system is NERSC9, scheduled to replace Cori in mid-2020. Details of NERSC9 are not yet available, but possibilities include a traditional Intel Xeon based system, an early version of the Aurora 2021 architecture, or a GPU accelerated system.
The pre-exascale systems (Summit and NERSC9) will represent a 10x improvement over the current LCF petscale systems (Titan and Mira). If we can make full use of this improvement, it will allow for an increased use of ensembles, a 2x increase in resolution, or local refinement with very high regional resolution. These systems will be followed by the first exascale systems, Aurora and Frontier, providing an additional 10x improvement in 2021.
Despite the uncertain exascale roadmap, there is one key hardware trend which guides all of our development; good performance requires very high arithmetic intensity, or large amounts of work per node. As a result, we expect that the new architectures will be well-suited for ultra-high resolution simulations, because a doubling in horizontal resolution in the atmosphere or ocean increases the number of calculations for that component by a factor of four. On the other hand, a doubling in resolution also requires a halving of the model time step to maintain numerical stability, thereby halving throughput. While exascale systems will allow Earth system simulation to exploit parallelism to an unprecedented degree, their limited clock speed will likely prevent any substantial improvement in maximum obtainable throughput.
Performance strategy
Our performance strategy is designed to meet three key goals during Phase II development:
- High Throughput: Obtain close to five SYPD for the E3SM v2 model on Summit, to enable long control simulations at high resolution. Based on the work completed in Phase I, we expect Summit to be the first machine to achieve this goal, although the GPUs will provide little benefit.
- Balanced Throughput, High Efficiency: Exploiting the GPU acceleration capability of the Summit system will be the main focus of our E3SM v2 performance work. Because each Summit node contains many GPUs, moderate GPU speedup will lead to a 10x increase in the amount of simulations we can perform with a typical allocation. This will transform the traditional high-resolution modeling approach of running a few one-off simulations into a true climate science simulation capability.
- Cloud Resolving atmosphere coupled with eddy resolving ocean: The performance work for E3SM v2, combined with our nonhydrostatic atmospheric model will lead to an E3SM v3 model that is capable of running at cloud resolving resolutions in the atmosphere, coupled to an eddy resolving ocean that is already used in the high-resolution E3SM v1 model. This capability will be prototyped on Summit and lead to a robust v3 capability on the first Exascale machines.