Table of Contents
Fetching ...

Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

Ryan Solgi, Kai Zhen, Rupak Vignesh Swaminathan, Nathan Susanj, Athanasios Mouchtaris, Siegfried Kunzmann, Zheng Zhang

TL;DR

This work tackles post-training compression of large language models by augmenting tensor-train decomposed weights with a learned sparse residual. The proposed Saten framework combines a low-rank TT component with a sparse residual, supporting both unstructured and structured sparsity and allowing fine-tuning directly in the compressed space. The authors provide a complexity analysis and demonstrate state-of-the-art performance on BERT-Base for GLUE and on LLaMA-3.2-1B, achieving meaningful compression with minimal accuracy loss. They also discuss tensor-shape optimization and embedding-layer sparsity, highlighting practical pathways for deploying compressed LLMs on constrained hardware.

Abstract

The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.

Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models

TL;DR

This work tackles post-training compression of large language models by augmenting tensor-train decomposed weights with a learned sparse residual. The proposed Saten framework combines a low-rank TT component with a sparse residual, supporting both unstructured and structured sparsity and allowing fine-tuning directly in the compressed space. The authors provide a complexity analysis and demonstrate state-of-the-art performance on BERT-Base for GLUE and on LLaMA-3.2-1B, achieving meaningful compression with minimal accuracy loss. They also discuss tensor-shape optimization and embedding-layer sparsity, highlighting practical pathways for deploying compressed LLMs on constrained hardware.

Abstract

The efficient implementation of large language models (LLMs) is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained large language models (LLMs) for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized language models, achieving state-of-the-art performance.

Paper Structure

This paper contains 27 sections, 13 equations, 2 figures, 6 tables, 2 algorithms.

Figures (2)

  • Figure 1: The sparse + low-rank tensor train representation for a weight matrix (or embedding table).
  • Figure 2: Accuracy versus density of saten(e) for SST2 and MRPC datasets.