New Zstash Capabilities

  • May 16, 2020
  • Releases
  •  

    With v0.4.1, Zstash can now archive locally in addition to its ongoing ability to save time during the archiving and extracting processes.

     

    What is Zstash?

    Zstash is a long-term archiving solution for E3SM.

    Performance comparison for ‘zstash create’.

    Figure 1: Performance comparison for ‘zstash create’. Left: manual operations with separate md5sum, tar and hsi put steps. Right: comparable combined operations with Zstash. Performance data for a 4 TB archive consisting of more than 13,000 files. Mean and range of three realizations on NERSC’s Data Transfer Nodes (dtn).

    E3SM simulations generate large amounts of data that require archiving for long-term storage. Within Department of Energy (DOE) computing centers, this is accomplished with tape High Performance Storage Systems (HPSS). For optimal performance, data stored on HPSS should consist of a relatively small number of large files. Therefore, it is not practical to directly archive individual E3SM model output files to HPSS. Additionally, data on HPSS can on rare occasions become corrupted. To detect data corruption, checksums should be computed during data archival and retrieval. To address these challenges, E3SM developers created Zstash, a command line utility written in Python. The program’s design is intentionally minimalist to provide an effective long-term archiving solution without creating an overly complicated (and hard to maintain) tool.

    Key features:

    • Files are archived into standard tar files with a user-specified maximum size optimized for HPSS storage, typically 128 to 256 GB.
    • Tar files are created locally first, then transferred to HPSS.
    • Checksums (md5) of input files are computed on-the-fly during archiving. For large files, this saves a considerable amount of time compared to separate checksumming and archiving steps (Figure 1). (Checksums are computed on-the-fly again when extracting files to verify file integrity.)
    • Checksums and additional metadata (size, modification time, tar file and offset) are stored in an SQLite3 index database.
    • The SQLite3 database enables faster retrieval of individual files by providing the containing tar file and offset (location) within that tar file.
    • Parallel extraction is supported for additional performance (Figure 2).

    E3SM now requires all simulations to be archived using Zstash to ensure standardization and completeness in simulation archiving.

    What’s new?

    Performance comparison for ‘zstash extract’.

    Figure 2: Performance comparison for ‘zstash extract’. Left: manual operations with separate hsi get, tar and md5sum steps. Middle: comparable combined operations with Zstash. Right: Zstash parallel with 3 workers. Performance data for a 4 TB archive consisting of more than 13,000 files. Mean and range of three realizations on NERSC’s Data Transfer Nodes (dtn).

    Zstash v0.4.1 has been released. Zstash can now be used on machines without HPSS (such as Anvil and Compy). By specifying --hpss=none, users can store files in a local archive instead of using an HPSS archive. The option to use a local archive is available even if the machine has HPSS. Running Zstash without HPSS decreases the runtime and allows users to transfer the tar files in the archive to long-term storage elsewhere, using a tool like Globus. The E3SM project recently decided to make LLNL’s petabyte storage system the permanent long-term archive for E3SM and has archived at LLNL all production simulation data including data from NERSC’s HPSS archive. Users must use Zstash to archive data at LLNL. Future production simulations will be permanently archived at LLNL, but can also be stored at NERSC via HPSS if it is convenient for users.

    Developers have completed a significant amount of work to improve how Zstash operates under the hood, refactoring the code and test suite to ensure that Zstash functions properly. Additionally, the Zstash build now uses noarch, so developers do not have to make different releases for Linux and macOS. Users can download a single release for use on either platform. Zstash is also included in the E3SM Unified environment.

    Send this to a friend