Scaling Ultrahigh-Resolution E3SM Land Model for Leadership-Class Supercomputers
Summary
A recent development in the E3SM land model enables kilometer-scale simulations across North America (21.5 million gridcells), leveraging up to 105,600 CPU cores on leadership-class supercomputers (Summit and Frontier). This advancement utilizes a data-driven configuration within the Common Infrastructure for Modeling the Earth (CIME) and innovative tools like KiloCraft to generate input data. The research showcases the model’s remarkable scalability on exascale computers and its exceptional IO performance, achieving nearly 400 GB/s in write throughput. This research solidifies the capability of E3SM Land Model (ELM) for conducting ultrahigh-resolution simulations over extensive geographical areas.
Science
Led by researchers at Oak Ridge National Laboratory, along with colleagues from Lawrence Livermore, Argonne National Laboratories, and Saint Louis University, this research centers on upscaling an ultrahigh-resolution E3SM Land Model (uELM). The uELM leverages supercomputing capabilities for conducting ultrahigh-resolution simulations across large geographical areas. By successfully porting and evaluating the uELM framework on Summit and Frontier, researchers have achieved the outstanding scalability on Summit with up to 2,400 nodes and 105,600 cores, as well as on Frontier with 1,200 nodes and 76,800 cores. The exceptional input/output (I/O) performance, notably the write throughput of approximately 395 GB/s achieved on Frontier, also illustrates the potential of advanced supercomputing environments to handle large-scale data efficiently. Overall, this study not only establishes uELM’s capability for high-resolution simulations over vast geographical domains but also sets a computational foundation for future Earth system modeling breakthroughs. The research is selected as one finalist of the CCGRID International Scalable Computing Challenge (SCALE 2025) which is to highlight and showcase real-world problem-solving using computing that scales.
The left picture in Figure 1 depicts weak scalability with I/O time, including history and restart outputs, on Frontier. The computation time (total time minus I/O time) is consistently around 120 seconds for all four cases. In the PnetCDF experiments, the I/O write bandwidth was only 12.2 GB/s, leading to degraded weak scalability when running on over 100 nodes. Conversely, experiments with the Adaptable IO System (ADIOS) showed excellent scalability, achieving an I/O write bandwidth of 186.3 GB/s with 300 nodes.
Strong scaling experiments (right picture, Fig. 1) reveal distinct outcomes for PnetCDF and ADIOS. Using PnetCDF, the execution time on 600 nodes is 435.06 seconds, which decreases for computation and I/O times when scaling to 1200 nodes, although the increased initialization time mitigates these gains. Write bandwidth increases from 12.22 to 20.62 GB/s as nodes increase from 300 to 1200. Conversely, with ADIOS, uELM performs significantly better, completing execution on 600 nodes in 158.92 seconds and reducing I/O time to 59.45 seconds, achieving a write bandwidth of 345.5 GB/s. However, scaling to 1200 nodes raises total runtime to 238.90 seconds, largely due to increased I/O time influenced by longer read operations (44.34 seconds versus 29.68 seconds on 600 nodes) and poor scalability of land component initialization with more nodes, although average bandwidth improves to 395 GB/s.
Impact
The research conducted on uELM demonstrates its significant potential for kilometer-scale simulations across extensive geographical areas, which holds profound implications for both scientific understanding and practical application. By providing detailed, accurate simulations of environmental phenomena at ultrahigh resolution, this research supports governments and organizations in making informed decisions that protect infrastructure, economies, and public health, thereby bolstering national security in an increasingly unpredictable world.
Publication
Wang, D., Wang, C., Cao, Q., Krishna, J., Wu, D., Zheng, W., Schwartz, P., Yuan, F., Mohror, K., & Thornton, P. (2025). Scaling Ultrahigh-Resolution E3SM Land Model for Leadership-Class Supercomputers, IEEE Symposium on Cluster, Cloud, and Internet Computing 2025, SCALE challenge finalist. DOI:10.1109/CCGridW65158.2025.00050
