ATLAHS Network Simulator Toolchain



Network simulators play a crucial role in evaluating the performance of large-scale systems. However, existing simulators rely heavily on synthetic microbenchmarks or narrowly focus on specific domains, limiting their ability to provide comprehensive performance insights. In this work, we introduce ATLAHS, which stands for Application-centric Network Simulator Toolchain for AI, HPC, and Distributed Storage, a flexible, extensible, and open-source toolchain designed to trace real-world applications and accurately simulate their workloads. ATLAHS leverages the GOAL format [1] to model communication and computation patterns in AI, HPC, and distributed storage applications.

Source Code

The source code of the entire ATLAHS toolchain is available on GitHub: https://github.com/spcl/atlahs.

Trace Collection

Along with the toolchain, we also release a set of application traces collected from large-scale systems. The traces can be accessed at this link. The traces are in both the raw and GOAL formats. The raw traces for HPC applications are collected using liballprof [2], a profiling library that can be used to collect traces of MPI applications. The raw traces for AI applications are collected using Nsight Systems, and are stored as nsys-report files. A detailed description of the systems and applications used for trace collection are as follows:

1. AI Applications

The traces are collected from the system with the following configuration:
  • System: CSCS Alps
  • Number of nodes: 2,688
  • Node configuration: 4 NVIDIA Grace Hopper Superchips
  • Network topology: Dragonfly
  • The applications and their configurations are as follows:
    App Configuration
    DLRM 4 GPUs 4 Nodes
    Llama 7B 16 GPUs 4 Nodes
    64 GPUs 16 Nodes
    128 GPUs 32 Nodes
    Llama 70B 256 GPUs 64 Nodes
    MoE (Mistral) 8x7B 64 GPUs 16 Nodes
    MoE 8x13B 128 GPUs 32 Nodes
    MoE 8x70B 256 GPUs 64 Nodes

    2. HPC Applications

    The traces are collected from the system with the following configuration:
    • System: CSCS Fat Tree Test-bed Cluster
    • Number of nodes: 188
    • Node configuration: 20-core Intel Xeon E5-2660 v2 CPU, 32 GB DDR3 RAM, ConnectX-3 56 Gbit/s NIC, Centos 7.3
    • Network topology: Fat Tree (18 Mellanox SX6036 switches)
    • The applications and their configurations are as follows:
      Application Configuration
      CloverLeaf 128 Procs 8 Nodes
      HPCG 128 Procs 8 Nodes
      512 Procs 32 Nodes
      1024 Procs 64 Nodes
      LULESH 128 Procs 8 Nodes
      432 Procs 27 Nodes
      1024 Procs 64 Nodes
      LAMMPS 128 Procs 8 Nodes
      512 Procs 32 Nodes
      1024 Procs 64 Nodes
      ICON 128 Procs 8 Nodes
      512 Procs 32 Nodes
      1024 Procs 64 Nodes
      OpenMX 128 Procs 8 Nodes
      512 Procs 32 Nodes

      References

      ICPP'09
      [1] T. Hoefler, C. Siebert, A. Lumsdaine:
       Group Operation Assembly Language - A Flexible Way to Express Collective Communication In ICPP-2009 - The 38th International Conference on Parallel Processing, presented in Vienna, Austria, IEEE, ISBN: 978-0-7695-3802-0, Sep. 2009, (acceptance rate 32%, 71/220)
      LSAP'10
      [2] T. Hoefler, T. Schneider, A. Lumsdaine:
       LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, presented in Chicago, Illinois, pages 597--604, ACM, ISBN: 978-1-60558-942-8, Jun. 2010, LSAP'10 Best Paper Award