Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes
Mohammadsajad Alipour, Mohammad Mohammadi Amiri
TL;DR
This work targets efficient storage of fine-tuning updates ${\Delta W}^l$ for large language models under a fixed memory budget. It introduces Optimal Singular Damage (OSD), a two-phase approach that first relaxes the low-rank approximation by allowing rank $k=r+c$ and then applies coordinated, importance-aware sparsification across the factor matrices ${U_k}$ and ${V_k}$ using a Taylor-based importance score ${Z^l}$; the total per-layer memory is ${\mu}^l = 32(s_u+s_v) + \lceil \log_2(n(r+c)) \rceil s_u + \lceil \log_2(d(r+c)) \rceil s_v$. OS D jointly optimizes rank relaxation and sparsity to maximize task utility within budget, deriving a principled memory-analysis and a single-pass sparsification algorithm that retains the most impactful parameters. Empirically, OSD consistently outperforms pure truncSVD and magnitude-based MagTruncSVD, especially in extreme low-storage regimes, and demonstrates no inference-time latency increase. The approach enables scalable, bandwidth- and memory-constrained deployment of personalized or task-specific models in settings such as federated learning, edge deployment, and model hubs. Future directions include adaptive rank selection and extending the framework to broader parameter-efficient fine-tuning (PEFT) scenarios.
Abstract
Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.
