Table of Contents
Fetching ...

Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters

Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Rafael Esteves, Shreya Kadambi, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

TL;DR

Sparse High Rank Adapters (SHiRA) address edge-deployment limitations of LoRA by finetuning only about $1\%$-$2\%$ of base weights using extreme sparsity, enabling rapid on-device adapter switching and reduced cross-adapter interference. SHiRA uses gradient-masked training with multiple mask families to realize high-rank sparse adapters without adding forward parameters, and supports rapid inference via a scatter_op weight overwrite rather than full fusion. Across vision and language tasks (including LLaMA, LLaMA2, and Stable Diffusion), SHiRA outperforms LoRA on single and multi-adapter setups, with notable gains such as up to $2.7\%$ higher commonsense accuracy on LLMs and an average $6.69\%$ improvement in multi-adapter fusion on LLaMA2-7B, while also reducing peak GPU memory by about $16.63\%$ and enabling up to $10\times$ faster CPU weight overwrites. The method is complementary to advanced LoRA variants like DoRA, exhibits orthogonality in fusion behavior, and provides a practical path to edge-friendly, low-overhead PEFT with robust adaptability. These contributions advance efficient on-device fine-tuning, rapid switching, and reliable multi-concept fusion for large-scale vision-language models.

Abstract

In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion. Our extensive experiments on LVMs and LLMs demonstrate that finetuning merely 1-2% parameters in the base model is sufficient for many adapter tasks and significantly outperforms Low Rank Adaptation (LoRA). We also show that SHiRA is orthogonal to advanced LoRA methods such as DoRA and can be easily combined with existing techniques.

Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters

TL;DR

Sparse High Rank Adapters (SHiRA) address edge-deployment limitations of LoRA by finetuning only about - of base weights using extreme sparsity, enabling rapid on-device adapter switching and reduced cross-adapter interference. SHiRA uses gradient-masked training with multiple mask families to realize high-rank sparse adapters without adding forward parameters, and supports rapid inference via a scatter_op weight overwrite rather than full fusion. Across vision and language tasks (including LLaMA, LLaMA2, and Stable Diffusion), SHiRA outperforms LoRA on single and multi-adapter setups, with notable gains such as up to higher commonsense accuracy on LLMs and an average improvement in multi-adapter fusion on LLaMA2-7B, while also reducing peak GPU memory by about and enabling up to faster CPU weight overwrites. The method is complementary to advanced LoRA variants like DoRA, exhibits orthogonality in fusion behavior, and provides a practical path to edge-friendly, low-overhead PEFT with robust adaptability. These contributions advance efficient on-device fine-tuning, rapid switching, and reliable multi-concept fusion for large-scale vision-language models.

Abstract

In this paper, we propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged, thus, resulting in a highly sparse adapter. This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion. Our extensive experiments on LVMs and LLMs demonstrate that finetuning merely 1-2% parameters in the base model is sufficient for many adapter tasks and significantly outperforms Low Rank Adaptation (LoRA). We also show that SHiRA is orthogonal to advanced LoRA methods such as DoRA and can be easily combined with existing techniques.
Paper Structure (28 sections, 7 figures, 8 tables)

This paper contains 28 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Sparse High Rank Adapters (SHiRA): Changing $\sim$1-2% weights of the pretrained model is often sufficient to achieve high performance. Due to its extreme sparsity, SHiRA enables rapid switching and also reduced concept loss during multi-adapter fusion. In contrast, LoRA modifies the majority of parameters when fused, prohibiting rapid switching on mobile devices and also experiences concept loss/artifacts during multi-adapter fusion.
  • Figure 2: (a) LoRA appends two low rank weights that can be fused into the base weights at inference. However, this modifies all weights and prevents rapid switching. (b) SHiRA finetunes very few pretrained weights by exploiting gradient-masking during training. We show that finetuning as low as $1$-$2\%$ parameters is sufficient to achieve high accuracy on many adapter tasks.
  • Figure 3: (a) Rapid switching: SHiRA adapters can be stored as sparse weights and their indices which can be loaded on the base model. Since only $1$-$2\%$ weights need to be overwritten, the adapter can be efficiently switched with different weights at inference, eliminating the need for a separate fusion stage. (b) Multi-adapter fusion: Multiple adapters can be fused together by naively adding them together and then loading the resulting sparse weights.
  • Figure 4: Comparison between different SHiRA masking methods for single and multi adapter image generation. For multi-adapter fusion, SHiRA-Struct outperforms all other adapters. SHiRA does not have artifacts and concept-loss like LoRA (see Koala/Knight).
  • Figure 5: Comparison between average times for LoRA-fuse and SHiRA-scatter_op implementation for 10 randomly initialized weights of various dimensions on a CPU (e.g., dimension = $4096$ means that the weight has shape $4096\times 4096$). For fusing, we compute time taken to merge LoRA adapters into the base weights (W + AB). Similarly, for the scatter_op, we report time taken to overwrite base weights with SHiRA weights using the scatter op (torch.Tensor.scatter_) based implementation in Pytorch.
  • ...and 2 more figures