Table of Contents
Fetching ...

FineGates: LLMs Finetuning with Compression using Stochastic Gates

Jonathan Svirsky, Yehonathan Refael, Ofir Lindenbaum

TL;DR

This work tackles the resource-intensive problem of finetuning large language models with limited data by introducing FineGates, a gating-based adaptor that sparsifies the frozen base weights while enabling task-specific adaptation. The method formulates a structured sparsity objective using learnable gates $\omega_r$ and $\omega_c$ applied to base weight matrices, optionally augmented by low-rank adapters $W_BW_A$, and trained with a Gaussian-relaxation of the $\ell_0$ penalty to encourage sparsity. Empirically, FineGates achieves competitive or superior accuracy to LoRA on GLUE benchmarks under data-limited settings, while reducing trainable parameters to roughly $0.14\%$ of the base model and compressing the base weights by $10$–$20\%$, yielding practical speedups. A convergence proof shows the relaxed objective has a Lipschitz-continuous gradient, supporting SGD optimization to a stationary point. The approach demonstrates meaningful speedups and model compression without substantial accuracy loss, with potential for further pruning of attention components and multi-task extensions.

Abstract

Large Language Models (LLMs), with billions of parameters, present significant challenges for full finetuning due to the high computational demands, memory requirements, and impracticality of many real-world applications. When faced with limited computational resources or small datasets, updating all model parameters can often result in overfitting. To address this, lightweight finetuning techniques have been proposed, like learning low-rank adapter layers. These methods aim to train only a few additional parameters combined with the base model, which remains frozen, reducing resource usage and mitigating overfitting risks. In this work, we propose an adaptor model based on stochastic gates that simultaneously sparsify the frozen base model with task-specific adaptation. Our method comes with a small number of trainable parameters and allows us to speed up the base model inference with competitive accuracy. We evaluate it in additional variants by equipping it with additional low-rank parameters and comparing it to several recent baselines. Our results show that the proposed method improves the finetuned model accuracy comparatively to the several baselines and allows the removal of up to 20-40\% without significant accuracy loss.

FineGates: LLMs Finetuning with Compression using Stochastic Gates

TL;DR

This work tackles the resource-intensive problem of finetuning large language models with limited data by introducing FineGates, a gating-based adaptor that sparsifies the frozen base weights while enabling task-specific adaptation. The method formulates a structured sparsity objective using learnable gates and applied to base weight matrices, optionally augmented by low-rank adapters , and trained with a Gaussian-relaxation of the penalty to encourage sparsity. Empirically, FineGates achieves competitive or superior accuracy to LoRA on GLUE benchmarks under data-limited settings, while reducing trainable parameters to roughly of the base model and compressing the base weights by , yielding practical speedups. A convergence proof shows the relaxed objective has a Lipschitz-continuous gradient, supporting SGD optimization to a stationary point. The approach demonstrates meaningful speedups and model compression without substantial accuracy loss, with potential for further pruning of attention components and multi-task extensions.

Abstract

Large Language Models (LLMs), with billions of parameters, present significant challenges for full finetuning due to the high computational demands, memory requirements, and impracticality of many real-world applications. When faced with limited computational resources or small datasets, updating all model parameters can often result in overfitting. To address this, lightweight finetuning techniques have been proposed, like learning low-rank adapter layers. These methods aim to train only a few additional parameters combined with the base model, which remains frozen, reducing resource usage and mitigating overfitting risks. In this work, we propose an adaptor model based on stochastic gates that simultaneously sparsify the frozen base model with task-specific adaptation. Our method comes with a small number of trainable parameters and allows us to speed up the base model inference with competitive accuracy. We evaluate it in additional variants by equipping it with additional low-rank parameters and comparing it to several recent baselines. Our results show that the proposed method improves the finetuned model accuracy comparatively to the several baselines and allows the removal of up to 20-40\% without significant accuracy loss.

Paper Structure

This paper contains 17 sections, 1 theorem, 17 equations, 3 figures, 5 tables.

Key Result

Proposition 1

Suppose, $f\equiv\mathcal{L}_{\text{task}},$ is an $L$-smooth non-convex function that is bounded by $M$, then minimizingBy using the vanilla SGD the whole objective $\mathcal{L}$ (FineGates) is guaranteed to converge to a stationary point.

Figures (3)

  • Figure 1: Two versions of our method: (a) In the first one, we train an adaptor with additional weights ${\hbox{\boldmath $W$}}_A, {\hbox{\boldmath $W$}}_B$. After training we compute the updated and pruned weight matrix $\Tilde{{\hbox{\boldmath $W$}}} = {\hbox{\boldmath $\omega$}}_r \cdot ({\hbox{\boldmath $W$}}_0 + {\hbox{\boldmath $W$}}_B{\hbox{\boldmath $W$}}_A)\cdot {\hbox{\boldmath $\omega$}}_c$. (b) In the simplified version, the adaptor is based only on the trainable gates vectors ${\hbox{\boldmath $\omega$}}_l, {\hbox{\boldmath $\omega$}}_r$ that enforce structured sparsity.
  • Figure 2: Sparsification-Accuracy trade-off measured on CoLA, SST2, and STSB datasets. Our model provides $>\mathbf{40\%}$ of structured sparsity while sacrificing only $\mathbf{4\%}$ of accuracy compared to the model without sparsification on the SST2 dataset where we train ${\hbox{\boldmath $\omega$}}_r, {\hbox{\boldmath $\omega$}}_c$ with total $166K$ parameters. On CoLA the method reduces up to $20\%$ of parameters without significant loss in accuracy, and $40\%$ on the STSB dataset with only $3\%$ drop in accuracy.
  • Figure 3: (a) Measuring relative time reduction in multiplication $({\hbox{\boldmath $W$}}^T \cdot {\hbox{\boldmath $\omega$}}) ({\hbox{\boldmath $X$}} \cdot {\hbox{\boldmath $\omega$}})$ compared to full matrices multiplication ${\hbox{\boldmath $W$}}^T{\hbox{\boldmath $X$}}$. We measure CPU time by repeating the operation 100K times and reporting the average time (vertical line) for each sparsity level (horizontal line). (b) Measuring inference time for a single validation epoch with varying sparsity levels.

Theorems & Definitions (2)

  • Proposition 1: Convergence of FineGates
  • proof