Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs
João Paulo Cardoso de Lima, Marc Dietrich, Jeronimo Castrillon, Asif Ali Khan
TL;DR
The paper tackles the memory-bound decoding of large language models by leveraging Monarch-structured block-diagonal sparsity and computing-in-memory (CIM) accelerators. It introduces an automated framework that transforms dense weights into Monarch matrices (D2S), maps these sparse structures onto CIM crossbars with latency- and capacity-optimized strategies, and uses a scheduling module to orchestrate execution while minimizing ADC-related energy and latency. Key contributions include dense-to-sparse transformation via $M = P L P R P$, two mapping schemes that maximize array utilization, rotation-aware handling of block-diagonal rotations, and permutation folding to reduce control overhead. Experimental results demonstrate up to 87% fewer CIM arrays, around 78.8% average array utilization, and average latency/energy improvements of up to ~1.6–1.7× across multiple transformer models, along with over 4× reductions in memory footprint and FLOPs compared to dense baselines, indicating strong practical impact for in-memory inference of structured sparse LLMs.
Abstract
Structured sparsity enables deploying large language models (LLMs) on resource-constrained systems. Approaches like dense-to-sparse fine-tuning are particularly compelling, achieving remarkable structured sparsity by reducing the model size by over 6.7x, while still maintaining acceptable accuracy. Despite this reduction, LLM inference, especially the decode stage being inherently memory-bound, is extremely expensive on conventional Von-Neumann architectures. Compute-in-memory (CIM) architectures mitigate this by performing computations directly in memory, and when paired with sparse LLMs, enable storing and computing the entire model in memory, eliminating the data movement on the off-chip bus and improving efficiency. Nonetheless, naively mapping sparse matrices onto CIM arrays leads to poor array utilization and diminished computational efficiency. In this paper, we present an automated framework with novel mapping and scheduling strategies to accelerate sparse LLM inference on CIM accelerators. By exploiting block-diagonal sparsity, our approach improves CIM array utilization by over 50%, achieving more than 4x reduction in both memory footprint and the number of required floating-point operations.
