Table of Contents
Fetching ...

Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML

Abhishek Vijaya Kumar, Eric Ding, Arjun Devraj, Darius Bunandar, Rachee Singh

TL;DR

This work tackles the core problem that torus-based ML fabrics, while efficient for large-scale training, underperform for multi-tenant inference and fine-tuning workloads due to bandwidth under-utilization, compute fragmentation, and large fault blast radii. Morphlux proposes a server-scale programmable photonic fabric that redirects accelerator egress bandwidth and reconfigures intra-server topology to maintain torus-like contention-free communication while enabling arbitrary tenant slices. The authors develop MorphMgr with a fragmented-slice ILP allocator, in-place fault tolerance via SRG-aware spare XPUs, and a hardware control plane to realize optical circuits; they validate these ideas with an end-to-end hardware prototype and a large-scale TPU-cluster simulator. Key results include up to $66\%$ bandwidth gains, up to $70\%$ fragmentation reduction, $1.72\times$ end-to-end training throughput improvement, and $\sim1\,\text{s}$ fault-recovery latency, demonstrating practical multi-tenant ML acceleration at server scale with optical programmability. The work highlights the potential of server-scale photonics to complement torus fabrics and deliver robust, scalable performance for diverse ML workloads, with open-source artifacts enabling further exploration.

Abstract

We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML data-centers with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize the blast radius of chip failures. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits which translate to 1.72X improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can replace a failed accelerator chip with a healthy one in 1.2 seconds.

Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML

TL;DR

This work tackles the core problem that torus-based ML fabrics, while efficient for large-scale training, underperform for multi-tenant inference and fine-tuning workloads due to bandwidth under-utilization, compute fragmentation, and large fault blast radii. Morphlux proposes a server-scale programmable photonic fabric that redirects accelerator egress bandwidth and reconfigures intra-server topology to maintain torus-like contention-free communication while enabling arbitrary tenant slices. The authors develop MorphMgr with a fragmented-slice ILP allocator, in-place fault tolerance via SRG-aware spare XPUs, and a hardware control plane to realize optical circuits; they validate these ideas with an end-to-end hardware prototype and a large-scale TPU-cluster simulator. Key results include up to bandwidth gains, up to fragmentation reduction, end-to-end training throughput improvement, and fault-recovery latency, demonstrating practical multi-tenant ML acceleration at server scale with optical programmability. The work highlights the potential of server-scale photonics to complement torus fabrics and deliver robust, scalable performance for diverse ML workloads, with open-source artifacts enabling further exploration.

Abstract

We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML data-centers with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize the blast radius of chip failures. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits which translate to 1.72X improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can replace a failed accelerator chip with a healthy one in 1.2 seconds.

Paper Structure

This paper contains 27 sections, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: shows how ML accelerators (XPUs) are stacked on top of the optical interposer in a single server. Connections between accelerators are direct optical links. Bottom figure shows a vertical cross-section of the server where two 2 XPUs connect on two tiles and communicate via the underlying optical link (shown in red). The connections between accelerators can be programmed to redirect egress bandwidth from one accelerator to another.
  • Figure 2: \ref{['fig:tpu-single']} shows a horizontal plane of the TPU rack. This plane has 4 servers (in blue), each with 4 TPU chips (in yellow). Each TPU chip has its escape bandwidth divided across links in three directions: X, Y and Z. \ref{['fig:tpu-pod']} shows how multiple planes are connected to each other to form a rack. Figures \ref{['fig:tpu-single']}, \ref{['fig:tpu-pod']} show the different rings used by the collective communication algorithms optimized for the torus topology. Figure \ref{['fig:tpu-slices']} shows how racks are connected to each other via OCSes to allocate multi-rack and sub-rack TPU slices to tenants.
  • Figure 3: \ref{['fig:tpu_slices']} shows a rack with multiple sub-rack tenant slices. \ref{['fig:y_links_underutil']} shows number of links that are used by slices in every rack (or block) in a torus-based TPU datacenter. \ref{['fig:agg_bw']} shows aggregate throughput of using one ICI link vs. using two ICI links. The aggregate throughput is measured through sending data in parallel to two destination devices TPU 1 and TPU 2 from TPU 0. Figure \ref{['fig:fragmentation']} shows fragmentation index across racks after allocating the cluster completely and deallocating 20% of the slices.
  • Figure 4: Design of MorphMgr.
  • Figure 5: Figure \ref{['fig:dp']} shows the recursion tree of Z(K) for 2 SRGs. In Figure \ref{['fig:zk-failures']}, we set SRG to a single XPU and N to 64. shows that with 4 XPUs are sufficient to respect $95\%$ SLO when SRG is a single XPU. Similarly Figure \ref{['fig:zk-failures-servers']} shows that two additional servers (4 XPUs per server) are sufficient to respect $95\%$ SLO in most cases. In both, \ref{['fig:zk-failures']} and \ref{['fig:zk-failures-servers']}, need for additional hardware increases with failure probability.
  • ...and 8 more figures