Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity
Jonathan Svirsky, Yehonathan Refael, Ofir Lindenbaum
TL;DR
The paper tackles the high cost of finetuning large language models by introducing FineGates, a gating-based approach that learns binary row/column masks to sparsify base weights during task adaptation. By updating only lightweight gate parameters and leveraging a Gaussian relaxation plus kurtosis-informed sparsity, the method removes up to $40\%$ of parameters and achieves meaningful inference speedups while maintaining accuracy, with theoretical convergence guarantees and a PL-based optimization landscape that is better conditioned than LoRA. Empirically, FineGates matches or surpasses strong baselines on GLUE with RoBERTa backbones and often exceeds LoRA on Llama3.2-1B, while enabling pruning during pretraining and CPU-friendly inference, highlighting structured sparsity as a practical mechanism for scalable LM adaptation.
Abstract
Fully finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting. Although methods like low-rank adapters help address these challenges by adding small trainable modules to the frozen LM, they also increase memory usage and do not reduce inference latency. We uncover an intriguing phenomenon: sparsifying specific model rows and columns enables efficient task adaptation without requiring weight tuning. We propose a scheme for effective finetuning via sparsification using training stochastic gates, which requires minimal trainable parameters, reduces inference time, and removes 20--40\% of model parameters without significant accuracy loss. Empirical results show it outperforms recent finetuning baselines in efficiency and performance. Additionally, we provide theoretical guarantees for the convergence of this stochastic gating process, and show that our method admits a simpler and better-conditioned optimization landscape compared to LoRA. Our results highlight sparsity as a compelling mechanism for task-specific adaptation in LMs.
