Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models
Ryan Solgi, Kai Zhen, Rupak Vignesh Swaminathan, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang
TL;DR
This work tackles post-training compression of large language models by augmenting tensor-train decomposed weights with a learned sparse residual. The proposed Saten framework combines a low-rank TT component with a sparse residual, supporting both unstructured and structured sparsity and allowing fine-tuning directly in the compressed space. The authors provide a complexity analysis and demonstrate state-of-the-art performance on BERT-Base for GLUE and on LLaMA-3.2-1B, achieving meaningful compression with minimal accuracy loss. They also discuss tensor-shape optimization and embedding-layer sparsity, highlighting practical pathways for deploying compressed LLMs on constrained hardware.
Abstract
The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.
