Table of Contents
Fetching ...

Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

TL;DR

The paper tackles the memory and compute bottlenecks of large LLMs and VLMs by introducing Activation-Informed Pareto-Guided SVD (PGSVD), a zero-shot low-rank compression framework. It develops a theoretical link between layer-wise activation-based compression errors and whole-network loss, proving that a single uniform tolerance induces surrogate Pareto-optimal, heterogeneous layer ranks. Building on this, PGSVD uses Pareto-guided rank selection and an efficient ALS solver to compress unimodal LLMs and multimodal VLMs, achieving higher accuracy at the same memory/throughput budgets and enabling real-time inference gains. Empirically, PGSVD attains up to 30–40% improvements in accuracy under comparable compression and shows strong zero-shot performance on reasoning benchmarks and multimodal tasks, with robust throughput improvements across devices. The approach offers a principled, data-agnostic knob for loss–compression trade-offs and extends naturally to modality-specific pipelines in multimodal models.

Abstract

Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

TL;DR

The paper tackles the memory and compute bottlenecks of large LLMs and VLMs by introducing Activation-Informed Pareto-Guided SVD (PGSVD), a zero-shot low-rank compression framework. It develops a theoretical link between layer-wise activation-based compression errors and whole-network loss, proving that a single uniform tolerance induces surrogate Pareto-optimal, heterogeneous layer ranks. Building on this, PGSVD uses Pareto-guided rank selection and an efficient ALS solver to compress unimodal LLMs and multimodal VLMs, achieving higher accuracy at the same memory/throughput budgets and enabling real-time inference gains. Empirically, PGSVD attains up to 30–40% improvements in accuracy under comparable compression and shows strong zero-shot performance on reasoning benchmarks and multimodal tasks, with robust throughput improvements across devices. The approach offers a principled, data-agnostic knob for loss–compression trade-offs and extends naturally to modality-specific pipelines in multimodal models.

Abstract

Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

Paper Structure

This paper contains 37 sections, 4 theorems, 35 equations, 4 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $\boldsymbol{x}_{l+1}=\sigma\, \!\bigl(\mathbf{W}_l \boldsymbol{x}_l\bigr)$, with batch $\mathbf{X}_l=\bigl[\boldsymbol{x}_l^{(1)}~\cdots~\boldsymbol{x}_l^{(B)}\bigr],$ where $\sigma$ acts elementwise and $\sup_{t\in\mathbb R}|\sigma'(t)|\le c<\infty$, and $\hat{\mathbf{W}}_l=\mathbf{W}_l+\Delta where $G := \bigl\|\nabla_{\mathbf{Y}}\mathcal{L}\bigr\|_{F}$, $\mathbf{Y}=\mathbf{X}_{L+1}$, $\mat

Figures (4)

  • Figure 1: Overview of PGSVD: (left) unimodal model using a uniform error tolerance that yields heterogeneous compression ratios; (right) multimodal model with separate uniform tolerances for each tower.
  • Figure 2: Compression times of different solvers for different models (top) and perplexity versus the number of ALS iterations for LLaMA-2-7B (bottom).
  • Figure 3: Inference throughput of LLaMA-2-7b (left) and Mistral 7b (right) for 20% and 40% compression using PGSVD and SVD-ALS compared to the base model.
  • Figure 4: SVD profiles for LLaMA-2 7B (left) and 13B (right).

Theorems & Definitions (14)

  • Theorem 1: Loss Sensitivity to Activation-Based Compression
  • proof
  • Definition 1: $\varepsilon$--Parameter Mapping via SVD
  • Proposition 1: Rank--$\varepsilon$ Allocation Equivalence
  • proof
  • Lemma 1: Uniform $\varepsilon$ under homogeneous sensitivity and bounded profiles
  • proof
  • Theorem 2: Uniform $\varepsilon$ yields the surrogate Pareto front of (B)
  • proof
  • proof
  • ...and 4 more