Table of Contents
Fetching ...

SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

Priyansh Bhatnagar, Linfeng Wen, Mingu Kang

TL;DR

A novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form to identify the optimal degree of compression for each layer is proposed.

Abstract

Extensive efforts have been made to boost the performance in the domain of language models by introducing various attention-based transformers. However, the inclusion of linear layers with large dimensions contributes to significant computational and memory overheads. The escalating computational demands of these models necessitate the development of various compression techniques to ensure their deployment on devices, particularly in resource-constrained environments. In this paper, we propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form. This approach automates the decision-making process to identify the optimal degree of compression for each layer. We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks. Additionally, we have validated our method on Mamba, a recently proposed state-space model. Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.

SoftLMs: Efficient Adaptive Low-Rank Approximation of Language Models using Soft-Thresholding Mechanism

TL;DR

A novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form to identify the optimal degree of compression for each layer is proposed.

Abstract

Extensive efforts have been made to boost the performance in the domain of language models by introducing various attention-based transformers. However, the inclusion of linear layers with large dimensions contributes to significant computational and memory overheads. The escalating computational demands of these models necessitate the development of various compression techniques to ensure their deployment on devices, particularly in resource-constrained environments. In this paper, we propose a novel compression methodology that dynamically determines the rank of each layer using a soft thresholding mechanism, which clips the singular values with a small magnitude in a differentiable form. This approach automates the decision-making process to identify the optimal degree of compression for each layer. We have successfully applied the proposed technique to attention-based architectures, including BERT for discriminative tasks and GPT2 and TinyLlama for generative tasks. Additionally, we have validated our method on Mamba, a recently proposed state-space model. Our experiments demonstrate that the proposed technique achieves a speed-up of 1.33X to 1.72X in the encoder/ decoder with a 50% reduction in total parameters.

Paper Structure

This paper contains 27 sections, 11 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Accuracy trend of low-rank approximation for BERT-Base with GLUE MRPC dataset. (a) accuracy drop when individually compressing a BERT block to reduce 50% of parameters, indicating varying contributions across blocks on overall task performance. (b) Rank of $W_{Q}$ layer obtained for each block from static vs. proposed adaptive rank decomposition after fine-tuning for both to have 50% overall parameter reductions. An F1 score gain of 4.3 is observed from the adaptive rank decomposition (F1: 90.7) over the static method (F1: 86.4).
  • Figure 2: SoftBERT encoder with $N$ blocks, where $W_{Q}$, $W_{K}$, $W_{V}$, $W_{proj}$ in Multi-Head Attention and $W_{fc1}$ and $W_{fc2}$ in the Feed Forward Network are substituted with U, S and V. The module S employs the Soft Threshold function to clamp the singular values in $\Sigma$, enabling dynamic rank for each block adaptively in fine-tuning.
  • Figure 3: Learnable threshold $(\alpha)$ for selecting top singular values in $\Sigma$; values below $(\alpha)$ are set to zero.
  • Figure 4: Threshold functions. (a) conventional non-differentiable thresholding, (b) shifted Tanh with sharpness control factor $s$, and (c) differentiable soft thresholding combining above functions
  • Figure 5: The magnitude of compression loss over the threshold $\alpha$ with a conventional linear loss ($\mathcal{L}_{cmp}$) and the proposed adaptive loss ($\mathcal{L}_{acmp}$).
  • ...and 9 more figures