LatentLLM: Attention-Aware Joint Tensor Compression
Toshiaki Koike-Akino, Xiangyu Chen, Jing Liu, Ye Wang, Pu, Wang, Matthew Brand
TL;DR
The paper tackles the substantial resource requirements of large language and multi-modal models by proposing LatentLLM, a training-free framework that converts pretrained models into a latent, reduced-dimension structure. It achieves this through a two-stage strategy: activation-aware local SVD pre-conditioning and a global, joint tensor decomposition that jointly compresses multiple linear modules, including Q/K/V/O and MLP components. Key contributions include optimal pre-conditioning with $P = C^{1/2}$, the junction-matrix trick to minimize parameters, and a high-order SVD approach (HOSVD/Tucker) to handle joint QK, VO, and UD decompositions. Empirically, LatentLLM and its multi-modal variant LatentLLaVa show significant improvements over local baselines across perplexity benchmarks and multi-modal reasoning tasks, with substantial reductions in FLOPs and parameters, enabling more efficient yet accurate LLMs/LMMs.
Abstract
Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure. Our method extends a local activation-aware tensor decomposition to a global attention-aware joint tensor de-composition. Our framework can significantly improve the model accuracy over the existing model compression methods when reducing the latent dimension to realize computationally/memory-efficient LLMs/LLMs. We show the benefit on several benchmark including multi-modal reasoning tasks.
