Table of Contents
Fetching ...

Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes

Mohammadsajad Alipour, Mohammad Mohammadi Amiri

TL;DR

This work targets efficient storage of fine-tuning updates ${\Delta W}^l$ for large language models under a fixed memory budget. It introduces Optimal Singular Damage (OSD), a two-phase approach that first relaxes the low-rank approximation by allowing rank $k=r+c$ and then applies coordinated, importance-aware sparsification across the factor matrices ${U_k}$ and ${V_k}$ using a Taylor-based importance score ${Z^l}$; the total per-layer memory is ${\mu}^l = 32(s_u+s_v) + \lceil \log_2(n(r+c)) \rceil s_u + \lceil \log_2(d(r+c)) \rceil s_v$. OS D jointly optimizes rank relaxation and sparsity to maximize task utility within budget, deriving a principled memory-analysis and a single-pass sparsification algorithm that retains the most impactful parameters. Empirically, OSD consistently outperforms pure truncSVD and magnitude-based MagTruncSVD, especially in extreme low-storage regimes, and demonstrates no inference-time latency increase. The approach enables scalable, bandwidth- and memory-constrained deployment of personalized or task-specific models in settings such as federated learning, edge deployment, and model hubs. Future directions include adaptive rank selection and extending the framework to broader parameter-efficient fine-tuning (PEFT) scenarios.

Abstract

Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.

Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes

TL;DR

This work targets efficient storage of fine-tuning updates for large language models under a fixed memory budget. It introduces Optimal Singular Damage (OSD), a two-phase approach that first relaxes the low-rank approximation by allowing rank and then applies coordinated, importance-aware sparsification across the factor matrices and using a Taylor-based importance score ; the total per-layer memory is . OS D jointly optimizes rank relaxation and sparsity to maximize task utility within budget, deriving a principled memory-analysis and a single-pass sparsification algorithm that retains the most impactful parameters. Empirically, OSD consistently outperforms pure truncSVD and magnitude-based MagTruncSVD, especially in extreme low-storage regimes, and demonstrates no inference-time latency increase. The approach enables scalable, bandwidth- and memory-constrained deployment of personalized or task-specific models in settings such as federated learning, edge deployment, and model hubs. Future directions include adaptive rank selection and extending the framework to broader parameter-efficient fine-tuning (PEFT) scenarios.

Abstract

Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.

Paper Structure

This paper contains 20 sections, 10 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The average accuracy of the models RobertaLarge and OPT-1.3b on eight datasets in different natural language processing tasks including sentiment analysis, text classification, paraphrase detection and natural language inference, when we perform TruncSVD with rank $k=r$ and the proposed integrated approach (MagTruncSVD) combining TruncSVD with rank $k=r+c$ and sparsification for the same storage budget as TruncSVD with $k=r$.
  • Figure 2: Effect of incorporating weight importance $Z^l$ into OSD and excluding it, on the model performance after approximation, for RobertaLarge and OPT-1.3b models.
  • Figure 3: Average performance across all tasks and all values of $r$ ($1 \leq r \leq 4$) for different values of $c$ in the OSD algorithm, evaluated on the RobertaLarge and OPT-1.3b models.