Table of Contents
Fetching ...

Neural Prompt Search

Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu

TL;DR

NOAH tackles the challenge of choosing an optimal prompt design for large vision transformers by framing prompt-module selection as a neural architecture search problem. It unifies Adapter, LoRA, and Visual Prompt Tuning within a one-shot NAS framework and uses evolutionary search to derive dataset-specific subnet architectures under a parameter budget. Across VTAB-1k, few-shot, and domain-generalization tasks, NOAH outperforms individual prompt modules and demonstrates complementary module usage and robust transferability. While incurring extra supernet training cost, NOAH provides a practical, data-driven approach to scalable, dataset-aware parameter-efficient tuning for vision models.

Abstract

The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as "prompt modules" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has a good few-shot learning ability, and (iii) is domain-generalizable. The code and models are available at https://github.com/Davidzhangyuanhan/NOAH.

Neural Prompt Search

TL;DR

NOAH tackles the challenge of choosing an optimal prompt design for large vision transformers by framing prompt-module selection as a neural architecture search problem. It unifies Adapter, LoRA, and Visual Prompt Tuning within a one-shot NAS framework and uses evolutionary search to derive dataset-specific subnet architectures under a parameter budget. Across VTAB-1k, few-shot, and domain-generalization tasks, NOAH outperforms individual prompt modules and demonstrates complementary module usage and robust transferability. While incurring extra supernet training cost, NOAH provides a practical, data-driven approach to scalable, dataset-aware parameter-efficient tuning for vision models.

Abstract

The size of vision models has grown exponentially over the last few years, especially after the emergence of Vision Transformer. This has motivated the development of parameter-efficient tuning methods, such as learning adapter layers or visual prompt tokens, which allow a tiny portion of model parameters to be trained whereas the vast majority obtained from pre-training are frozen. However, designing a proper tuning method is non-trivial: one might need to try out a lengthy list of design choices, not to mention that each downstream dataset often requires custom designs. In this paper, we view the existing parameter-efficient tuning methods as "prompt modules" and propose Neural prOmpt seArcH (NOAH), a novel approach that learns, for large vision models, the optimal design of prompt modules through a neural architecture search algorithm, specifically for each downstream dataset. By conducting extensive experiments on over 20 vision datasets, we demonstrate that NOAH (i) is superior to individual prompt modules, (ii) has a good few-shot learning ability, and (iii) is domain-generalizable. The code and models are available at https://github.com/Davidzhangyuanhan/NOAH.
Paper Structure (36 sections, 4 equations, 6 figures, 5 tables)

This paper contains 36 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Our approach, neural prompt search, or NOAH for short, subsumes three representative parameter-efficient tuning methods (i.e., Adapter houlsby2019parameter, LoRA hu2021lora and VPT jia2022visual) and learns from data the optimal design through neural architecture search (a). The approach is motivated by the observation that none of the three individuals shows dominance on the VTAB-1k benchmark (b). The colors of the datasets' names indicate which method performs the best. Clearly, NOAH is the best overall approach.
  • Figure 2: Illustration of crossover and mutation in the evolutionary search method.
  • Figure 3: Group-wise average results on VTAB-1k. NOAH performs the best in the Natural and Structured groups while its performance in the Specialized group is similar to that of LoRA---but NOAH does not require a manual search over the architecture and hyper-parameters.
  • Figure 4: Results of few-shot learning on five fine-grained visual recognition datasets. NOAH beats the individual modules on average.
  • Figure 5: Average subnets (architectures) for the three groups in VTAB-1k. Adapter and LoRA tend to live in deep layers while VPT is found nearly in all depths. The demands for VPT (indicated by the embedding dimension) differ in different groups. The co-existence of the three modules, especially in deep layers, serves as strong evidence of their complementarity, and such a synergy is difficult to obtain by hand-engineering.
  • ...and 1 more figures