Table of Contents
Fetching ...

TRAWL: Tensor Reduced and Approximated Weights for Large Language Models

Yiran Luo, Het Patel, Yu Fu, Dawon Ahn, Jia Chen, Yue Dong, Evangelos E. Papalexakis

TL;DR

TRAWL tackles the inefficiency of large language models by denoising weights with tensor decomposition across multiple matrices, instead of single-matrix factorization. It stacks Q/K/V/O or FC weights into a 3-mode tensor and applies CP or Tucker decomposition, treating the rank $R$ as a hyperparameter, with no additional training required. Across two models and three benchmark datasets, layer-by-layer CP decomposition of the final FC layers yields the strongest gains, up to 16% accuracy improvements, while global decomposition can hurt performance, and segmented layers offer targeted benefits. The work demonstrates practical, post-training compression that reduces noise and improves generalization, and provides public code to enable further research and real-world application.

Abstract

Recent research has shown that pruning large-scale language models for inference is an effective approach to improving model efficiency, significantly reducing model weights with minimal impact on performance. Interestingly, pruning can sometimes even enhance accuracy by removing noise that accumulates during training, particularly through matrix decompositions. However, recent work has primarily focused on single matrix decompositions or lower precision techniques, which may fail to fully capture structural patterns. To address these limitations, we introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns. Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.

TRAWL: Tensor Reduced and Approximated Weights for Large Language Models

TL;DR

TRAWL tackles the inefficiency of large language models by denoising weights with tensor decomposition across multiple matrices, instead of single-matrix factorization. It stacks Q/K/V/O or FC weights into a 3-mode tensor and applies CP or Tucker decomposition, treating the rank as a hyperparameter, with no additional training required. Across two models and three benchmark datasets, layer-by-layer CP decomposition of the final FC layers yields the strongest gains, up to 16% accuracy improvements, while global decomposition can hurt performance, and segmented layers offer targeted benefits. The work demonstrates practical, post-training compression that reduces noise and improves generalization, and provides public code to enable further research and real-world application.

Abstract

Recent research has shown that pruning large-scale language models for inference is an effective approach to improving model efficiency, significantly reducing model weights with minimal impact on performance. Interestingly, pruning can sometimes even enhance accuracy by removing noise that accumulates during training, particularly through matrix decompositions. However, recent work has primarily focused on single matrix decompositions or lower precision techniques, which may fail to fully capture structural patterns. To address these limitations, we introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a technique that applies tensor decomposition across multiple weight matrices to effectively denoise LLMs by capturing global structural patterns. Our experiments show that TRAWL improves model performance by up to 16% over baseline models on benchmark datasets, without requiring additional data, training, or fine-tuning.

Paper Structure

This paper contains 17 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Tensor formulation by stacking QKVO or FC weights from a single layer. A 3-mode tensor is created for each layer, and tensor decomposition is applied.
  • Figure 2: Tensor construction by stacking QKVO or FC weights across all layers. This forms a 3-mode tensor for tensor decomposition.
  • Figure 3: RoBERTa performance when approximating the last few FC layers one at a time: Results for the BiosProfession dataset on the left and the BigBench WikiQA dataset on the right. The blue dashed line represents the best LASER result, while the red dashed line indicates the baseline model performance without any decomposition. Decomposing the last few layers individually led to the highest performance gains.
  • Figure 4: Performance of CP-decomposed FC weights across the GPT-J model on BigBench WikiQA. Combining all layers into one decomposition significantly reduces accuracy, likely due to layer heterogeneity. By contrast, a layer-by-layer approach preserves or improves performance.
  • Figure 5: GPT-J performance when approximating the last few FC layers one at a time: Results for the BiosProfession dataset on the left and the BigBench WikiQA dataset on the right. The blue dashed line represents the best LASER result, while the red dashed line indicates the baseline model performance without any decomposition. Decomposing the last few layers individually led to the highest performance gains.