Table of Contents
Fetching ...

Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Zixuan Hu, Yongxian Wei, Li Shen, Chun Yuan, Dacheng Tao

TL;DR

This work tackles the lack of tuning-free few-shot adaptability in Visual Foundation Models by reusing diverse pre-tuned LoRAs without access to their original data. It introduces LoRA Inversion to synthesize surrogate data and a meta-learning objective to distill a lightweight meta-LoRA, enabling single-pass adaptation on new tasks in a tuning-free manner. A double-efficient mechanism accelerates both data inversion and meta-training by selectively keeping informative tokens, improving efficiency without sacrificing performance. Across in-domain and cross-domain benchmarks, LoRA Recycle demonstrates solid improvements over fine-tuning baselines and other tuning-free methods, highlighting its practical potential for rapid, private-data-free VFM deployment.

Abstract

Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.

Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

TL;DR

This work tackles the lack of tuning-free few-shot adaptability in Visual Foundation Models by reusing diverse pre-tuned LoRAs without access to their original data. It introduces LoRA Inversion to synthesize surrogate data and a meta-learning objective to distill a lightweight meta-LoRA, enabling single-pass adaptation on new tasks in a tuning-free manner. A double-efficient mechanism accelerates both data inversion and meta-training by selectively keeping informative tokens, improving efficiency without sacrificing performance. Across in-domain and cross-domain benchmarks, LoRA Recycle demonstrates solid improvements over fine-tuning baselines and other tuning-free methods, highlighting its practical potential for rapid, private-data-free VFM deployment.

Abstract

Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.

Paper Structure

This paper contains 21 sections, 8 equations, 9 figures, 18 tables, 1 algorithm.

Figures (9)

  • Figure 1: Thanks to the modularity of LoRA, users can upload locally tuned LoRAs to public repositories without exposing original training data. LoRA Recycle distills a meta-LoRA from these LoRAs without needing their original training data. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass without fine-tuning.
  • Figure 2: Pipeline of LoRA Recycle. (i) (Pink Path) We generate task-specific surrogate data from the pre-tuned LoRA via LoRA Inversion. The input data (attached with the fire in the left corner) is initialized as Gaussian noise and iteratively optimized by minimizing $\mathcal{L}_{\rm data}$ (\ref{['eq:inversion']}). The surrogate data is then used to construct a meta-training task with one support set and one query set. (ii) (Black Path) We meta-train the meta-LoRA (attached with the fire in the middle) on a wide range of pre-tuned LoRAs by minimizing the meta-learning objective $\mathcal{L}_{\rm meta}$ (\ref{['eq:LoRAhubDistillation']}), explicitly teaching it how to adapt without fine-tuning.
  • Figure 3: Double-Efficient Mechanism. (Left: Inversion Stage) During the inversion stage, token pruning is applied in the hidden layers by removing unimportant tokens based on self-attention weights, accelerating both forward and backward computations for data generation. (Right: Meta-Training Stage) To highlight the most informative areas in the generated image, we construct a mask by setting values of 1 at the positions of remaining tokens and 0 elsewhere. We multiply the mask with the generated image to create a masked image. We then exclusively use the unmasked tokens for the following meta-training stage. This selective use of sparse tokens significantly accelerates the meta-training process, while maintaining or even improving performance by reducing noise from the generated data.
  • Figure 4: Visualization of generated images with their 75% token-masked versions.
  • Figure 5: Visualization of generated images with (left) and without (right) the naturalness prior $\mathcal{R}_{\rm BN}$.
  • ...and 4 more figures