A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Aojie Jiang; Kang Zhu; Zhiheng Zhang; Zhengxu Su; Juntao Liu; Yuan Du; Li Du

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Aojie Jiang, Kang Zhu, Zhiheng Zhang, Zhengxu Su, Juntao Liu, Yuan Du, Li Du

Abstract

In-network computing techniques, exemplified by NVLink Sharp (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations, such as All-Reduce, to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger reduction operations, which means that the data reduced in the switch must be additionally transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS must operate at FP16/BF16 precision, leading to substantial bandwidth waste.To address these limitations, we propose SCIN, the first switch-centric in-network architecture for shared-memory networks of AI accelerators, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of initiating memory-semantic operations for in-network processing, together with a co-designed communication fabric that incurs negligible protocol overhead. By eliminating redundant data movement, SCIN delivers lower All-Reduce latency than NVLS. Moreover, by integrating a quantization module into the ISA, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a prototype of SCIN on a multi-FPGA system to demonstrate its feasibility and effectiveness. Experimental results show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, leading up to 1.74x faster TTFT and 1.34x faster TPOT on LLaMA-2 models.

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Abstract

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

Abstract

Paper Structure

Table of Contents

Figures (12)