Table of Contents
Fetching ...

LION: Implicit Vision Prompt Tuning

Haixin Wang, Jianlong Chang, Xiao Luo, Jinan Sun, Zhouchen Lin, Qi Tian

TL;DR

LION presents a parameter-efficient vision prompt tuning approach by inserting two equilibrium implicit layers at the input and head of a frozen backbone, generating task-specific prompts while pruning non-crucial parameters via the lottery ticket hypothesis. The method achieves competitive or improved accuracy with far fewer trainable parameters than baselines like VPT, across CNN and Transformer backbones, and shows strong generalization in long-tail and few-shot settings. Theoretical and empirical analyses demonstrate effective optimization through Deep Equilibrium layers and robust training, with practical benefits for transfer learning and cloud deployment. Overall, LION offers a memory-stable, highly efficient pathway to adapt pre-trained vision models to diverse downstream tasks.

Abstract

Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory due to insertion of extensive prompts blocks and trick prompt designs. In this paper, we propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained main backbone with parameters in the backbone frozen. Moreover, we prune the parameters in these two layers according to lottery hypothesis. The performance obtained by our LION are promising on a wide range of datasets. In particular, our LION reduces up to 11.5% of training parameter numbers while obtaining higher performance compared with the state-of-the-art baseline VPT, especially under challenging scenes. Furthermore, we find that our proposed LION had a good generalization performance, making it an easy way to boost transfer learning in the future.

LION: Implicit Vision Prompt Tuning

TL;DR

LION presents a parameter-efficient vision prompt tuning approach by inserting two equilibrium implicit layers at the input and head of a frozen backbone, generating task-specific prompts while pruning non-crucial parameters via the lottery ticket hypothesis. The method achieves competitive or improved accuracy with far fewer trainable parameters than baselines like VPT, across CNN and Transformer backbones, and shows strong generalization in long-tail and few-shot settings. Theoretical and empirical analyses demonstrate effective optimization through Deep Equilibrium layers and robust training, with practical benefits for transfer learning and cloud deployment. Overall, LION offers a memory-stable, highly efficient pathway to adapt pre-trained vision models to diverse downstream tasks.

Abstract

Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory due to insertion of extensive prompts blocks and trick prompt designs. In this paper, we propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained main backbone with parameters in the backbone frozen. Moreover, we prune the parameters in these two layers according to lottery hypothesis. The performance obtained by our LION are promising on a wide range of datasets. In particular, our LION reduces up to 11.5% of training parameter numbers while obtaining higher performance compared with the state-of-the-art baseline VPT, especially under challenging scenes. Furthermore, we find that our proposed LION had a good generalization performance, making it an easy way to boost transfer learning in the future.
Paper Structure (18 sections, 1 theorem, 13 equations, 3 figures, 4 tables)

This paper contains 18 sections, 1 theorem, 13 equations, 3 figures, 4 tables.

Key Result

Proposition 1

There exists the vision prompt $x_{pro} = A x$ for invertible $A$ and $y_{pro} = y$ that can minimize the population loss: $min_{W}\mathcal{L}(\hat{v}, W) = 0$. However, the vision prompt $z_{pro} = B z$ may not be sufficient: there exists such $B$ such that the population loss is non-zero for any c

Figures (3)

  • Figure 1: Demonstration of the implicit vision prompt layer. The left part shows the traditional construction of the prompt block by stacking MLPs. The right is LION with the implicit equilibrium layer and robust training for the prompt block.
  • Figure 2: Structures of our LION. We add two implicit layers, which are only injected in front of the input and behind the output of the pre-trained backbone respectively, as the vision prompts to enrich the vision input and representation.
  • Figure 3: Sensitivity analysis on two hyper-parameters.

Theorems & Definitions (2)

  • Proposition 1
  • proof