Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration

Yinxiao Feng; Kaisheng Ma

Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration

Yinxiao Feng, Kaisheng Ma

TL;DR

This work proposes Switch-Less Dragonfly on Wafers, a scalable interconnection architecture that eliminates costly high-radix switches by using wafer-scale, distributed networks-on-chip-on-wafer for intra- and inter-C-group connectivity. It presents a five-level topology (chiplet, C-group, wafer, W-group, system), and introduces routing schemes with minimal and non-minimal paths that require only one additional virtual channel to avoid deadlocks. Through layout and cycle-accurate simulations, the approach demonstrates higher local injection throughput, comparable global throughput, and substantial cost and cabling reductions versus traditional Dragonfly, with performance improvements contingent on increasing intra-C-group bandwidth. The results suggest wafer-scale switch-less designs can scale to large HPC systems and be adaptable to other direct topologies, potentially enabling power-efficient, high-bandwidth future supercomputers.

Abstract

Existing high-performance computing (HPC) interconnection architectures are based on high-radix switches, which limits the injection/local performance and introduces latency/energy/cost overhead. The new wafer-scale packaging and high-speed wireline technologies provide high-density, low-latency, and high-bandwidth connectivity, thus promising to support direct-connected high-radix interconnection architecture. In this paper, we propose a wafer-based interconnection architecture called Switch-Less-Dragonfly-on-Wafers. By utilizing distributed high-bandwidth networks-on-chip-on-wafer, costly high-radix switches of the Dragonfly topology are eliminated while increasing the injection/local throughput and maintaining the global throughput. Based on the proposed architecture, we also introduce baseline and improved deadlock-free minimal/non-minimal routing algorithms with only one additional virtual channel. Extensive evaluations show that the Switch-Less-Dragonfly-on-Wafers outperforms the traditional switch-based Dragonfly in both cost and performance. Similar approaches can be applied to other switch-based direct topologies, thus promising to power future large-scale supercomputers.

Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration

TL;DR

Abstract

Paper Structure (48 sections, 7 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 48 sections, 7 equations, 15 figures, 4 tables, 1 algorithm.

Introduction
Background & Motivation
Wafer-Scale Integration
Technology Introduction
Wafer-based interconnection
HPC Network Fabric
State-of-the-Art Interconnection Networks
Dragonfly
Diameter 2 Topologies
HammingMesh
Architecture
Topology Description
Chiplet
C-Group
Wafer & W-Group
...and 33 more sections

Figures (15)

Figure 1: Profile of the InFO-SoW integration technology. Connectors and power modules are solder-joined to the InFO wafer Chun_InFO_SoWSystemonWaferHigh_2020.
Figure 2: The Dragonfly-based Slingshot topology. Switches are fully connected within groups, and groups are also all-to-all connected.
Figure 3: Hierarchical architecture of the wafer-based switch-less Dragonfly. (a) A chiplet has an on-chip network and several short-reach low-latency interfaces used for interconnection. (b) Several chiplets are connected by a planar topology (2D-mesh as the default), forming a C-group. The remaining short-reach interfaces at the edges of the C-group are converted to long-reach interfaces for upper-level high-radix interconnection. (c)(d) Each wafer consists of several C-groups, and several wafers form a W-group. All C-groups in a W-group are fully-connected. (e) All the w-groups in the system are also fully-connected, just as the Dragonfly topology.
Figure 4: Bottleneck of the switch-less Dragonfly in collective communication. (a) Ring AllReduce algorithm; (b) 2D algorithm for AllReduce within the 2D-mesh-based C-group; (c) Local/global link underutilization due to injection bandwidth limit.
Figure 5: Wafer-level long-distance connectivity. All the edge IOs of each C-group are fanned out, and the long-distance wafer-level logical links are connected off-wafer physically.
...and 10 more figures

Theorems & Definitions (1)

Definition 1

Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration

TL;DR

Abstract

Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture based on Wafer-Scale Integration

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (1)