Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

Jonathan Svirsky; Yehonathan Refael; Ofir Lindenbaum

Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

Jonathan Svirsky, Yehonathan Refael, Ofir Lindenbaum

TL;DR

The paper tackles the high cost of finetuning large language models by introducing FineGates, a gating-based approach that learns binary row/column masks to sparsify base weights during task adaptation. By updating only lightweight gate parameters and leveraging a Gaussian relaxation plus kurtosis-informed sparsity, the method removes up to $40\%$ of parameters and achieves meaningful inference speedups while maintaining accuracy, with theoretical convergence guarantees and a PL-based optimization landscape that is better conditioned than LoRA. Empirically, FineGates matches or surpasses strong baselines on GLUE with RoBERTa backbones and often exceeds LoRA on Llama3.2-1B, while enabling pruning during pretraining and CPU-friendly inference, highlighting structured sparsity as a practical mechanism for scalable LM adaptation.

Abstract

Fully finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting. Although methods like low-rank adapters help address these challenges by adding small trainable modules to the frozen LM, they also increase memory usage and do not reduce inference latency. We uncover an intriguing phenomenon: sparsifying specific model rows and columns enables efficient task adaptation without requiring weight tuning. We propose a scheme for effective finetuning via sparsification using training stochastic gates, which requires minimal trainable parameters, reduces inference time, and removes 20--40\% of model parameters without significant accuracy loss. Empirical results show it outperforms recent finetuning baselines in efficiency and performance. Additionally, we provide theoretical guarantees for the convergence of this stochastic gating process, and show that our method admits a simpler and better-conditioned optimization landscape compared to LoRA. Our results highlight sparsity as a compelling mechanism for task-specific adaptation in LMs.

Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

TL;DR

of parameters and achieves meaningful inference speedups while maintaining accuracy, with theoretical convergence guarantees and a PL-based optimization landscape that is better conditioned than LoRA. Empirically, FineGates matches or surpasses strong baselines on GLUE with RoBERTa backbones and often exceeds LoRA on Llama3.2-1B, while enabling pruning during pretraining and CPU-friendly inference, highlighting structured sparsity as a practical mechanism for scalable LM adaptation.

Abstract

Paper Structure (44 sections, 6 theorems, 35 equations, 9 figures, 13 tables)

This paper contains 44 sections, 6 theorems, 35 equations, 9 figures, 13 tables.

Introduction
Related Work
Low-Rank Adaptation
Memory efficient optimizers.
Finetuning with Pruning
Finetuning with Adaptive Pruning
Problem Formulation
Finetuning by sparsification.
The Method
Theoretical Analysis
FineGates vs LoRA Landscape
FineGates Convergence
Experiments
Experimental Setup and Datasets
Baselines
...and 29 more sections

Key Result

Proposition 5.2

Let $\mathbf{W}_0\in\mathbb{R}^{m\times n}$ to be a weights matrix in a single linear layer trained with a smooth loss function $\mathcal{L}(\mathbf{W})$ that satisfies the PL condition. Let

Figures (9)

Figure 1: CPU inference time reduction (%) and number of removed parameters on the MRPC validation set while finetuning our method on the Llama3.2-1B backbone. See Section \ref{['sec:inference_time']} for details.
Figure 2: Overview of FineGates: Our method introduces structured sparsity in LM finetuning by training lightweight row and column gating vectors ($\bm{\omega}_c, \bm{\omega}_r$). These gates selectively retain the most informative weight dimensions, enabling efficient adaptation without modifying the base model’s parameters. Unlike LoRA and other PEFT methods, which introduce additional trainable matrices, FineGates directly optimizes sparsification and updates biases, thereby reducing memory overhead and inference time while maintaining task performance.
Figure 3: Validation accuracy trajectories of FineGates and LoRA on the MRPC dataset.
Figure 4: Sparsification-Accuracy trade-off measured on CoLA, SST2 (full), STSB, and MRPC datasets with RoBERTa-Base, RoBERTa-Large, and Llama3.2-1B backbones. Our model provides $>\mathbf{40\%}$ of structured sparsity while sacrificing only $\mathbf{4\%}$ of accuracy compared to the model without sparsification on the SST2 dataset, where we train $\bm{\omega}_r, \bm{\omega}_c$ with total $166K$ parameters. On CoLA, the method reduces up to $20\%$ of parameters without significant loss in accuracy, and $40\%$ on the STSB dataset with only $3\%$ drop in accuracy. The method removes up to $~470M$ parameters from the Llama3.2-1B base model with only $6\%$ accuracy loss.
Figure 5: CPU Inference time measurements along with validation accuracy averaged across 3 seeds for a single validation epoch of MRPC dataset.
...and 4 more figures

Theorems & Definitions (11)

Definition 5.1: Polyak--Łojasiewicz (PL) Condition
Proposition 5.2: FineGates vs LoRA optimization landscape
Definition 5.3: $L$-continuity
Lemma 5.4: Convergence of FineGates
Lemma B.1: Chain-rule lower bound via Jacobian
Lemma B.2: Uniform lower bound on the Jacobian for gates
proof
Proposition B.3: PL is preserved under two-sided gates
Proposition B.4: Counterexample for LoRA
proof
...and 1 more

Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

TL;DR

Abstract

Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (11)