Table of Contents
Fetching ...

Sparse High Rank Adapters

Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Shreya Kadambi, Rafael Esteves, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, Markus Nagel

TL;DR

This paper introduces Sparse High Rank Adapters (SHiRA), a parameter-efficient tuning paradigm that trains highly sparse, high-rank adapters by gradient masking to modify only about 1–2% of base model weights. SHiRA enables rapid on-device adapter switching in fused mode and reduces concept loss during multi-adapter fusion, addressing key edge deployment challenges of LoRA. The authors provide both theoretical analyses of rank-sparsity tradeoffs and adapter orthogonality, and extensive experiments across vision and language tasks showing SHiRA often outperforms LoRA while enabling faster CPU loading and lower memory usage. A practical PEFT-based implementation with scatter-based weight overwriting demonstrates substantial deployment advantages, making SHiRA attractive for real-world edge and on-device applications, including SDXL/DreamBooth personalization.

Abstract

Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model significantly outperforms LoRA while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library which trains at nearly the same speed as LoRA while consuming up to 16% lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases. To demonstrate rapid switching benefits during inference, we show that loading SHiRA on a base model can be 5x-16x faster than LoRA fusion on a CPU.

Sparse High Rank Adapters

TL;DR

This paper introduces Sparse High Rank Adapters (SHiRA), a parameter-efficient tuning paradigm that trains highly sparse, high-rank adapters by gradient masking to modify only about 1–2% of base model weights. SHiRA enables rapid on-device adapter switching in fused mode and reduces concept loss during multi-adapter fusion, addressing key edge deployment challenges of LoRA. The authors provide both theoretical analyses of rank-sparsity tradeoffs and adapter orthogonality, and extensive experiments across vision and language tasks showing SHiRA often outperforms LoRA while enabling faster CPU loading and lower memory usage. A practical PEFT-based implementation with scatter-based weight overwriting demonstrates substantial deployment advantages, making SHiRA attractive for real-world edge and on-device applications, including SDXL/DreamBooth personalization.

Abstract

Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models, adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights on how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning only a small fraction of the parameters in the base model significantly outperforms LoRA while enabling both rapid switching and multi-adapter fusion. Finally, we provide a latency- and memory-efficient SHiRA implementation based on Parameter-Efficient Finetuning (PEFT) Library which trains at nearly the same speed as LoRA while consuming up to 16% lower peak GPU memory, thus making SHiRA easy to adopt for practical use cases. To demonstrate rapid switching benefits during inference, we show that loading SHiRA on a base model can be 5x-16x faster than LoRA fusion on a CPU.
Paper Structure (55 sections, 5 theorems, 5 equations, 13 figures, 13 tables)

This paper contains 55 sections, 5 theorems, 5 equations, 13 figures, 13 tables.

Key Result

Lemma 4.1

The parameter complexity and learning complexity of SHiRA is equal to the number of non-zero elements in the adapter.

Figures (13)

  • Figure 1: Sparse High Rank Adapters (SHiRA): Changing about $1$-$2\%$ weights of the pretrained generative model is often sufficient to achieve high performance. Due to its extreme sparsity, SHiRA enables rapid switching and also reduced concept loss during multi-adapter fusion. In contrast, LoRA modifies majority of parameters when fused, thus prohibiting rapid switching on mobile devices, and also experiences concept loss during multi-adapter fusion. For LoRA, elephant for single "paintings" adapter case has artifacts (extra/broken tusks); bird and knight for multi-adapter case lose "paintings" concept and keep only the "blue fire" effects. SHiRA does not experience these issues.
  • Figure 2: (a) LoRA when fused into the pretrained model modifies all weights and prevents rapid adapter switching. (b) SHiRA does not require additional weights during training but finetunes very few pretrained weights. Our approach relies on a sparse mask for gradient-masking during training. We show that finetuning as low as $1$-$2\%$ parameters is sufficient to achieve high accuracy.
  • Figure 3: (a) Rapid adapter switching: The sparse finetuned weights can be stored as weights and their indices. At inference time, these weights can be loaded on the base model weights. Since only $1$-$2\%$ weights need to be overwritten, the adapter can be efficiently switched with different weights at inference, eliminating the need for a separate fusion stage. (b) Multi-adapter fusion: Concept-loss can be reduced if multiple adapters do not significantly interfere with each other.
  • Figure 4: Comparison of average AWOM (left) and AWOR (right) for 50 randomly initialized adapters. We compare different adapters, namely - Dense, Sparse LoRA, SHiRA-WM and SHiRA-Struct.
  • Figure 5: Comparison between different SHiRA masking methods for single- and multi-adapter image generation. For multi-adapter fusion, SHiRA-Struct outperforms all other adapters by generating exceptional images with high frequency details and good concept fusion (e.g., see fox and flower).
  • ...and 8 more figures

Theorems & Definitions (10)

  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Lemma 4.4
  • Lemma 4.5
  • proof
  • proof
  • proof
  • proof
  • proof