New Chrysalis Machine Coming
In August 2020, the E3SM project expanded its dedicated computational resources by purchasing a new system funded by the Department of Energy (DOE) Office of Biological and Environmental Research (BER).
The machine will be called Chrysalis, a name voted on by the leadership team and proposed by the E3SM Program Manager, Dr. Xujing Davis. It reminds her of her recent experience of raising Monarch Butterflies. “Watching the process of how a caterpillar phased into a beautiful butterfly via the Chrysalis is amazing; it made me think of our E3SM development/transformation from one phase to another and the incredible effort behind the scenes from everyone through the process. After systematically consuming leaves/energy, the chrysalis provides a protective housing for a caterpillar to undergo a remarkable, transformative process over time, culminating in its evolution into a magnificent butterfly. It is our hope and plan that this new machine act in similar fashion for the development of DOE’s E3SM model.”, explained Dr. Davis.
Chrysalis will reside next to Anvil, the first E3SM dedicated machine purchased in 2016, and will be hosted by Argonne National Laboratory’s Computing Resource Center. The installation started mid-August, and the early user period is expected to begin in the middle of October, with cluster and storage available to all E3SM users at the end of October.
Motivation for the Machine
DOE is a leader in advanced computing architectures, with the Leadership Computing Facilities (LCFs, Argonne Leadership Computing Facility and Oak Ridge Leadership Computing Facility) and the National Energy Research Scientific Computing Center (NERSC) transitioning away from traditional CPUs into systems which rely on GPU acceleration. These new architectures are ideal for many E3SM simulations, especially ultra-high resolution or multi-scale modeling framework (MMF) approaches to cloud resolving simulations. However, the E3SM scientists learned early in the project, that these advanced architectures are difficult to use for model development and for the low resolution IPCC type simulations where overnight turnaround and good performance on code that has not yet been ported to GPUs is a requirement. For these types of workloads, dedicated CPU resources are critical and the project’s past procurements of Anvil (2016) and Compy (2018) have been extremely productive.
Chrysalis Hardware Specifications
- 512 nodes, dual socket, 64 cores per node
- AMD Epyc 7502 processors (32 cores per CPU)
- 256GB 16 channel DDR4 3200 memory per node
- HDR200 interconnect
- 2PB of disk space
- 4PB tape backup
One of the things that is new in Chrysalis is the CPU; the E3SM will be using AMD for the first time and not Intel. Image source and further technical specifications: https://www.amd.com/system/files/documents/TIRIAS-White-Paper-AMD-Infinity-Architecture.pdf .
The 1.3 Petaflops (Peak) Cluster
Chrysalis, the 1.3 petaflops (peak) cluster will be operated exclusively for E3SM efforts, with user access authorized by designated E3SM team members. The dedicated 3 petabyte E3SM data storage will be an integrated addition to Argonne’s Laboratory Computing Resource Center’s (LCRC’s) new DDN storage system and also includes 4 petabytes of tape archive/backup space.
E3SM team members will have a familiar and comfortable user environment, backed by LCRC’s first class user and system support. The software stack will be similar to that on Anvil, deploying Centos Linux, the Slurm job manager, and a broad set of compilers and libraries. E3SM team members will be able to retain their existing home and project file access on the new system, with greatly expanded file space. The file systems are based on IBM’s fast and reliable Spectrum Scale software (formerly called GPFS), as tuned for LCRC’s high performance systems.