SPCL - Publications

Conference Rankings

SPONSORS

COLLABORATIONS

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

Y. Feng, T. Chen, Y. Wei, S. Shen, S. Wang, W. Li, K. Ma, T. Hoefler:

RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems

(arXiv:2507.18889. Jul. 2025)

Abstract

Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the Rail-optimized network are extremely expensive, while direct topologies such as Torus have insufficient bisection bandwidth and flexibility. In this paper, we propose RailX, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on Hamiltonian Decomposition theory to organize separate rail-based rings into all-to-all topology, simultaneously optimizing ring-collective and all-to-all communication. More than 100K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only 2∼4 inter-node hops. The network cost per injection/All-Reduce bandwidth of RailX is less than 10% of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than 50% of the Fat-Tree. Specifically, only ∼1.3B is required to interconnect 200K chips with 1.8TB bandwidth. RailX can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.

Documents

download article:

access preprint on arxiv:

BibTeX

@article{yinxiao2025railx,
  author={Yinxiao Feng and Tiancheng Chen and Yuchen Wei and Siyuan Shen and Shiju Wang and Wei Li and Kaisheng Ma and Torsten Hoefler},
  title={{RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems}},
  journal={arXiv:2507.18889},
  year={2025},
  month={07},
  doi={10.48550/arXiv.2507.18889},
}