Table of Contents
Fetching ...

NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation

Renqi Chen, Haoyang Su, Shixiang Tang

TL;DR

NAS-LoRA tackles the challenge of adapting SAM to domain-specific tasks by inserting a lightweight NAS block between LoRA's encoder and decoder to dynamically inject task-relevant inductive biases. It introduces a stage-wise optimization strategy and a PEFT-compatible NAS variant (NAS-PC-LoRA) to maintain efficiency while improving high-level semantic learning. Across nine segmentation benchmarks, NAS-LoRA and NAS-PC-LoRA outperform existing PEFT methods, achieving higher accuracy with around 24% lower training cost and no increase in inference cost. The work demonstrates that neural architecture search can be effectively and practically integrated into parameter-efficient fine-tuning for visual foundation models.

Abstract

The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.

NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation

TL;DR

NAS-LoRA tackles the challenge of adapting SAM to domain-specific tasks by inserting a lightweight NAS block between LoRA's encoder and decoder to dynamically inject task-relevant inductive biases. It introduces a stage-wise optimization strategy and a PEFT-compatible NAS variant (NAS-PC-LoRA) to maintain efficiency while improving high-level semantic learning. Across nine segmentation benchmarks, NAS-LoRA and NAS-PC-LoRA outperform existing PEFT methods, achieving higher accuracy with around 24% lower training cost and no increase in inference cost. The work demonstrates that neural architecture search can be effectively and practically integrated into parameter-efficient fine-tuning for visual foundation models.

Abstract

The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.

Paper Structure

This paper contains 16 sections, 10 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: NAS-LoRA adds a lightweight, trainable NAS cell between LoRA's encoder and decoder to inject optimized prior knowledge. This enhancement shows superiority across each segmentation tasks, with only a minimal increase in training cost and no additional inference overhead.
  • Figure 2: The proposed NAS-LoRA framework for fine-tuning SAM. The upper part illustrates the design of each SAM component: NAS-LoRA is applied to the self-attention layers of the image encoder, the prompt encoder is frozen to enable automated processing, and the mask decoder is fully fine-tuned without LoRA, as it is a lightweight module. The lower part depicts the end-to-end stage-wise optimization of NAS-LoRA, where Stage 1 and Stage 2 iteratively update model weights and architecture parameters using two independent optimizers. After optimization, the learned architecture parameters and weights are directly merged into the pre-trained model for downstream tasks, following the standard LoRA merging process.
  • Figure 3: Visual comparisons on sample images from the ISIC 2017 ($1^{st}$ line), Leaf ($2^{nd}$ line), CAMO ($3^{rd}$ line), Road ($4^{th}$ line), and Transparent Object ($5^{th}$ line) datasets.
  • Figure 4: Heatmap visualization by Grad-CAM selvaraju2017grad on Leaf. Compared to previous methods, applying NAS-LoRA could capture more fine-grained details.
  • Figure C1: Convergence behavior of NAS-PC-LoRA on the Leaf dataset over three independent trials.
  • ...and 1 more figures