E3SM – A Decade of Progress: Lessons Learned

  • August 29, 2024
  • Feature Story,Home Page Feature
  •  

    Ten years of development on E3SM has given plenty of time to learn important lessons. We asked group leads to share their most valuable lessons learned.

    Science Simulation Groups

    • Include “sanity checks” such as balance checks or in-bounds checks in the software. This is crucial to identifying problems early.
    • Be conservative in planning and willing to drop or delay features if they threaten to delay the overall science goals.
    • “Operationalize” the workflow for production simulations. For example, each time a new configuration or run-script was created, the team reviewed a checklist to avoid common human errors, thus catching a lot of small mistakes early – saving time and compute resources.

    Infrastructure

    • Use version control and open development tools. Git and GitHub have been essential in enabling the development of E3SM by dozens of scientists simultaneously. Git’s branching model allows researchers to easily isolate their development on feature branches and lets the group define criteria to be passed before the feature is allowed on the main branch. GitHub allowed the group to easily control access to the model code by staff at multiple labs. The GitHub web interface added functionality on top of git. GitHub features such as Pull Requests, which allow code to be reviewed before being included into the main code base, simplified joint development, and controlled growth of the code base.
    • Perform comprehensive overnight testing. This is essential to allow daily changes to the model. The group is always adjusting the balance between how many tests are run and what hardware they are run on to give developers timely assurance that their code modifications are not breaking the model.

    Coupled Model Group

    • Practice what you preach. Best practices improve productivity by streamlining the workflow, ensuring provenance and reproducibility, and preserving simulation data.
    • Do a code review deep dive. Perform extended coupled simulations, for active features and tentative stealth features. Some hidden issues can be better exposed in coupled mode.
    • Engage expertise from all component groups for each major stage of configuring the coupled physical system.
    • Use automated diagnostics suites to provide timely and extensive diagnostics coverage.
    • Dig into abnormal behaviors. This can be very rewarding (as demonstrated by Andrew Roberts’s debugging efforts).
    • Exercise development code by running multiple configurations regularly to avoid unpleasant surprises (historical, idealized CO2, +4K, …).

    Project Management

    • Establish open and transparent development practices. These included requiring everyone’s work to be documented on the project’s wiki (Confluence) and requiring open meetings (where everyone is welcome to attend) with meeting agenda/time/call number before the meeting, followed by meeting notes after the meeting.
    • Establish agile planning and reporting practices, with both long-term planning (10-year mission and vision, 3-year roadmaps) and short-term planning (yearly and quarterly roadmaps for all groups). Project tracking is done bi-weekly, requiring concise reports from everyone on their tasks and quarterly summary reports from group leads. The agile development cycle is closed with a retrospective of last quarter’s planning before planning for the next quarter.
    • Follow an open development model for all software.
    Send this to a friend