Table of Contents
Fetching ...

LatentLLM: Attention-Aware Joint Tensor Compression

Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang, Pu, Wang, Matthew Brand

TL;DR

The paper tackles the substantial resource requirements of large language and multi-modal models by proposing LatentLLM, a training-free framework that converts pretrained models into a latent, reduced-dimension structure. It achieves this through a two-stage strategy: activation-aware local SVD pre-conditioning and a global, joint tensor decomposition that jointly compresses multiple linear modules, including Q/K/V/O and MLP components. Key contributions include optimal pre-conditioning with $P = C^{1/2}$, the junction-matrix trick to minimize parameters, and a high-order SVD approach (HOSVD/Tucker) to handle joint QK, VO, and UD decompositions. Empirically, LatentLLM and its multi-modal variant LatentLLaVa show significant improvements over local baselines across perplexity benchmarks and multi-modal reasoning tasks, with substantial reductions in FLOPs and parameters, enabling more efficient yet accurate LLMs/LMMs.

Abstract

Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.

LatentLLM: Attention-Aware Joint Tensor Compression

TL;DR

The paper tackles the substantial resource requirements of large language and multi-modal models by proposing LatentLLM, a training-free framework that converts pretrained models into a latent, reduced-dimension structure. It achieves this through a two-stage strategy: activation-aware local SVD pre-conditioning and a global, joint tensor decomposition that jointly compresses multiple linear modules, including Q/K/V/O and MLP components. Key contributions include optimal pre-conditioning with , the junction-matrix trick to minimize parameters, and a high-order SVD approach (HOSVD/Tucker) to handle joint QK, VO, and UD decompositions. Empirically, LatentLLM and its multi-modal variant LatentLLaVa show significant improvements over local baselines across perplexity benchmarks and multi-modal reasoning tasks, with substantial reductions in FLOPs and parameters, enabling more efficient yet accurate LLMs/LMMs.

Abstract

Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.

Paper Structure

This paper contains 44 sections, 126 equations, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Reduced-dimension LLM/LMM with low-rank tensor decomposition. (a) each linear modules are locally compressed by activation-aware tensor decomposition. (b) multiple linear modules are globally compressed by attention-aware tensor decomposition.
  • Figure 2: Activation-aware compression with pre-conditioning and junction matrix. The junction matrix $J$ can be adjusted such that $A$ or $B$ is block identity to save the number of parameters and inference computation.
  • Figure 3: Tucker decomposition for joint QK compression. The compression matrices $A_\mathrm{q}$ and $A_\mathrm{k}$ correspond to the Tucker tensor planes, while $H=A_\mathrm{q} G A_\mathrm{k}^\top$ is the Tucker tensor core. For simplicity, we omit junction matrices and pre-conditioning matrix.
  • Figure 4: Perplexity over compression ratio for OPT models.
  • Figure 5: Perplexity vs. FLOPs of compressed OPT for 125M to 13B models.
  • ...and 11 more figures

Theorems & Definitions (11)

  • Remark 1
  • Remark 2
  • Remark 3
  • Remark 4
  • Remark 5
  • Remark 6
  • Remark 7
  • Remark 8
  • Remark 9
  • Remark 10
  • ...and 1 more