Table of Contents
Fetching ...

Mixed-precision Supernet Training from Vision Foundation Models using Low Rank Adapter

Yuiko Sakuma, Masakazu Yoshimura, Junji Otsuka, Atsushi Irie, Takeshi Ohashi

TL;DR

This work tackles the challenge of deploying large vision foundation models under hardware constraints by jointly optimizing a mixed-precision quantized supernet and its architecture. It introduces memory-efficient training via quantization-aware LoRA, including multi-path and multiplex LoRA designs, plus a progressive training strategy to stabilize ultra-low bit-width subnets. The method is validated on SAM-based segmentation tasks, achieving about a 95% reduction in BitOPs without accuracy loss and with notable memory savings relative to existing mixed-precision NAS baselines. The approach enables efficient, scalable deployment of VFMs across diverse hardware, balancing performance, memory, and computation.

Abstract

Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware. We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted. However, existing methods face difficulties in optimizing the mixed-precision search space and incurring large memory costs during training. To tackle these challenges, first, we study the effective search space design for fine-tuning a VFM by comparing different operators (such as resolution, feature size, width, depth, and bit-widths) in terms of performance and BitOPs reduction. Second, we propose memory-efficient supernet training using a low-rank adapter (LoRA) and a progressive training strategy. The proposed method is evaluated for the recently proposed VFM, Segment Anything Model, fine-tuned on segmentation tasks. The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.

Mixed-precision Supernet Training from Vision Foundation Models using Low Rank Adapter

TL;DR

This work tackles the challenge of deploying large vision foundation models under hardware constraints by jointly optimizing a mixed-precision quantized supernet and its architecture. It introduces memory-efficient training via quantization-aware LoRA, including multi-path and multiplex LoRA designs, plus a progressive training strategy to stabilize ultra-low bit-width subnets. The method is validated on SAM-based segmentation tasks, achieving about a 95% reduction in BitOPs without accuracy loss and with notable memory savings relative to existing mixed-precision NAS baselines. The approach enables efficient, scalable deployment of VFMs across diverse hardware, balancing performance, memory, and computation.

Abstract

Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware. We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted. However, existing methods face difficulties in optimizing the mixed-precision search space and incurring large memory costs during training. To tackle these challenges, first, we study the effective search space design for fine-tuning a VFM by comparing different operators (such as resolution, feature size, width, depth, and bit-widths) in terms of performance and BitOPs reduction. Second, we propose memory-efficient supernet training using a low-rank adapter (LoRA) and a progressive training strategy. The proposed method is evaluated for the recently proposed VFM, Segment Anything Model, fine-tuned on segmentation tasks. The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.
Paper Structure (32 sections, 4 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 32 sections, 4 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1:
  • Figure 3: The proposed LoRA-based architecture. (a) The regular LoRA freezes the pre-trained weight $W_0$ and only fine-tunes the low-rank decomposition weights of $A$ and $B$. (b) The selective method switches the decomposed weights $A_n$ and $B_n$ according to the layer bit-widths. (c) The multiplex method always trains the base weights $A_{\mathrm{base}}$ and $B_{\mathrm{base}}$. For low bit-width layers. Further, the additional weights bounded with the layer's bit-width $A_n$ and $B_n$ are trained.
  • Figure 4: Comparison with state-of-the-art mixed-precision supernet methods for subnets with different BitOPs budget
  • Figure 5: Ablation studies for the proposed LoRA-based architectures