Heroic Bug Fixes: How zstash Improvements Helped the High-Resolution Team Archive at Scale

  • May 28, 2026
  • Brief
  •  

    Bugs are an inevitable part of any complex software project, and E3SM is no exception. A lot of time goes into finding and fixing bugs, the resulting impacts can rival major parameterization changes, but these efforts and their impacts frequently go unreported. Heroic Bug Fixes is a recurring column that celebrates the critical yet often overlooked work of debugging. We hope that by shining a well-deserved spotlight on this critical work we can inspire further debugging efforts across the community and provide the broader E3SM community with timely information about changes which could aid their own development and investigations.

    For E3SM scientists, archiving simulation output should be routine. Once a run is complete, data should move into long-term storage reliably and efficiently, without demanding extra manual work from researchers. As the high-resolution team ramped up in late 2025, their larger data volumes made archive efficiency especially important. Workflow limitations that might be manageable at smaller scales became much more significant for these campaigns. Quickly resolving any such limitations was thus critical.

    In December 2025, the high-res team reported a critical task for the archiving package zstash, involving three concerns: (1) long-running Globus transfers timing out, and long runtimes on (2) zstash check, a command for checking the integrity of data archives, and (3) zstash update, a command for updating data archives.

    The lead developer of zstash, Ryan Forsyth, got to work addressing these three related problems. The token-timeout issue grew out of NERSC’s earlier migration to Globus GCSv5. zstash had been updated to address most of the related authentication problems, including improved scope handling for endpoint consent, direct use of the globus-sdk authentication flow, and local storage of refresh tokens to support automatic token renewal. The one open question was whether those changes fully eliminated timeouts in very long transfers. This required rethinking testing schemes. To launch a transfer that would run long enough would require a lot of data. Alternatively, a sleep period could be added to the code to mimic a long-running transfer, but doing so could introduce new problems. Ryan and Wuyin Lin, of the high-resolution team, as well as other zstash stakeholders, worked through methods to verify if the token-timeout resolution was resolved. Ultimately the only way to know for sure was to run a real-world long-running transfer. Wuyin did exactly that, running for nearly 120 hours — five full days. This confirmed that earlier Globus authentication improvements had indeed already resolved the token-timeout issue. The latest version of zstash was no longer running into the problem. Transfers could continue well beyond the old 48-hour limit without requiring manual re-authentication. That finding closed one of the three critical-task items without further code changes, thanks to the careful investigation from both Wuyin and Ryan.

    The second issue, slow performance in zstash check for updated runs, also turned out to have a practical resolution. By mid-December, it was recognized that the existing --tars option already provided what users needed by allowing them to specify where to resume, rather than starting again from the beginning. Once that was identified and communicated, the issue was effectively resolved.

    The remaining issue, zstash update performance, did require a code change. For high-resolution workflows, update operations were taking too long, creating a meaningful bottleneck in the archive process. It took some time for the team to determine the root cause of the inefficiency, with Ryan beginning a performance profiling effort. Ryan and Chris Golaz came to the conclusion that os.lstat could be the main source. It turned out that replacing os.lstat with os.scandir significantly improved performance. Even better, the debugging effort led to a performance profiling script that is now being routinely used on zstash, thus ensuring performance going into the future.

    By January 2026, all three concerns raised just the month before had been resolved. Together, those efforts gave the high-resolution team a smoother, faster, and more dependable archive workflow for large simulation campaigns.

    Contact

    • Ryan Forsyth, Lawrence Livermore National Laboratory
     
     

    This article is a part of the E3SM “Floating Points” Newsletter, to read the full Newsletter check:

    Send this to a friend