Table of Contents
Fetching ...

FastReChain: A Novel Bidirectional Model-Based Algorithm for Topology Engineering of OCS-Based Clusters

Zihan Zhu, Xinchi Han, Dongchao Wu, Zhanbang Zhang, Jian Yang, Shizhen Zhao, Xinbing Wang

TL;DR

This paper proposes a novel bidirectional modeling approach, along with a corresponding FastReChain algorithm, and proves the superiority of this algorithm through simulation experiments based on real-trace data.

Abstract

Optical Circuit Switching (OCS) technology is increasingly being adopted in data centers due to its advantages of low power consumption and low technology refresh costs. Unlike electrical packet switches, OCS provides programmable bandwidth for directly connected devices by configuring the mapping relationships of internal ports. Thus, how to calculate these internal port mapping relationships, i.e., Topology Engineering (ToE), is one of the key designs of OCS-based clusters. Current deployments usually design ToE algorithms by solving Integer Linear Programming (ILP) models, with the aim of minimizing modifications to links occupied by running tasks as much as possible. However, ILP-based ToE algorithms may incur excessive runtime overhead in large-scale clusters. Some existing ToE algorithms convert the ILP model into a Minimum-Cost Flow model through greedy construction, yet such greedy strategies may increase the number of affected links during the OCS reconfiguration process. To solve the aforementioned problems, we propose a novel bidirectional modeling approach, along with a corresponding FastReChain algorithm in this paper. We verify the superiority of this algorithm through simulation experiments based on real-trace data.

FastReChain: A Novel Bidirectional Model-Based Algorithm for Topology Engineering of OCS-Based Clusters

TL;DR

This paper proposes a novel bidirectional modeling approach, along with a corresponding FastReChain algorithm, and proves the superiority of this algorithm through simulation experiments based on real-trace data.

Abstract

Optical Circuit Switching (OCS) technology is increasingly being adopted in data centers due to its advantages of low power consumption and low technology refresh costs. Unlike electrical packet switches, OCS provides programmable bandwidth for directly connected devices by configuring the mapping relationships of internal ports. Thus, how to calculate these internal port mapping relationships, i.e., Topology Engineering (ToE), is one of the key designs of OCS-based clusters. Current deployments usually design ToE algorithms by solving Integer Linear Programming (ILP) models, with the aim of minimizing modifications to links occupied by running tasks as much as possible. However, ILP-based ToE algorithms may incur excessive runtime overhead in large-scale clusters. Some existing ToE algorithms convert the ILP model into a Minimum-Cost Flow model through greedy construction, yet such greedy strategies may increase the number of affected links during the OCS reconfiguration process. To solve the aforementioned problems, we propose a novel bidirectional modeling approach, along with a corresponding FastReChain algorithm in this paper. We verify the superiority of this algorithm through simulation experiments based on real-trace data.

Paper Structure

This paper contains 29 sections, 2 theorems, 19 equations, 11 figures, 7 tables, 2 algorithms.

Key Result

Lemma 1

In the proportional traditional model, for the scheduling of a single connection between $U_{j_0}$ and $V_{k_0}$, assume that the link between $T_{i_0}$ and $U_{j_0}$ and the link between $T_{i_1}$ and $V_{k_0}$ are not fully used for some $T_{i_0}$ and $T_{i_1}$, then a valid replacement chain can

Figures (11)

  • Figure 1: A simple example of the structure of the bidirectional model.
  • Figure 2: A simple example of the traditional model.
  • Figure 3: The CDFs of the slowdown ratios for both JRT and JCT under ILP-based ToE algorithms demonstrate that minimizing the solving overhead is critical for OCS-based ML clusters.
  • Figure 4: OCS reconfiguration can cause packet loss, resulting in a significant reduction in ML training throughput for a specific time window, making it essential to minimize the rewiring ratio.
  • Figure 5: Mapping loss quantified in the bidirectional model. Instances labeled with the "mapped" suffix are versions that use a bidirectional model to traditional model conversion.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Lemma 1
  • proof
  • Theorem 1
  • proof