Morphlux: Transforming Torus Fabrics for Efficient Multi-tenant ML
Abhishek Vijaya Kumar, Eric Ding, Arjun Devraj, Darius Bunandar, Rachee Singh
TL;DR
This work tackles the core problem that torus-based ML fabrics, while efficient for large-scale training, underperform for multi-tenant inference and fine-tuning workloads due to bandwidth under-utilization, compute fragmentation, and large fault blast radii. Morphlux proposes a server-scale programmable photonic fabric that redirects accelerator egress bandwidth and reconfigures intra-server topology to maintain torus-like contention-free communication while enabling arbitrary tenant slices. The authors develop MorphMgr with a fragmented-slice ILP allocator, in-place fault tolerance via SRG-aware spare XPUs, and a hardware control plane to realize optical circuits; they validate these ideas with an end-to-end hardware prototype and a large-scale TPU-cluster simulator. Key results include up to $66\%$ bandwidth gains, up to $70\%$ fragmentation reduction, $1.72\times$ end-to-end training throughput improvement, and $\sim1\,\text{s}$ fault-recovery latency, demonstrating practical multi-tenant ML acceleration at server scale with optical programmability. The work highlights the potential of server-scale photonics to complement torus fabrics and deliver robust, scalable performance for diverse ML workloads, with open-source artifacts enabling further exploration.
Abstract
We develop Morphlux, a server-scale programmable photonic fabric to interconnect accelerators within servers. We show that augmenting state-of-the-art torus-based ML data-centers with Morphlux can improve the bandwidth of tenant compute allocations by up to 66%, reduce compute fragmentation by up to 70%, and minimize the blast radius of chip failures. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits which translate to 1.72X improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can replace a failed accelerator chip with a healthy one in 1.2 seconds.
