Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
TL;DR
The paper tackles the memory and compute bottlenecks of large LLMs and VLMs by introducing Activation-Informed Pareto-Guided SVD (PGSVD), a zero-shot low-rank compression framework. It develops a theoretical link between layer-wise activation-based compression errors and whole-network loss, proving that a single uniform tolerance induces surrogate Pareto-optimal, heterogeneous layer ranks. Building on this, PGSVD uses Pareto-guided rank selection and an efficient ALS solver to compress unimodal LLMs and multimodal VLMs, achieving higher accuracy at the same memory/throughput budgets and enabling real-time inference gains. Empirically, PGSVD attains up to 30–40% improvements in accuracy under comparable compression and shows strong zero-shot performance on reasoning benchmarks and multimodal tasks, with robust throughput improvements across devices. The approach offers a principled, data-agnostic knob for loss–compression trade-offs and extends naturally to modality-specific pipelines in multimodal models.
Abstract
Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.
