Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference
Yiqi Liu, Yudong Pan, Mengdi Wang, Shixin Zhao, Haonan Zhu, Yinhe Han, Lei Zhang, Ying Wang
TL;DR
Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration is proposed, and three innovations are introduced to maximize its limited first-level capacity.
Abstract
Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations: Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer. Experimental results show Ouroboros achieves average gains of $4.1\times$ in throughput and $4.2\times$ in energy efficiency, peaking at $9.1\times$ and $17\times$ for the 13B model. (*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)
