Table of Contents
Fetching ...

Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning

Haowei Zhu, Fangyuan Zhang, Rui Qin, Tianxiang Pan, Junhai Yong, Bin Wang

TL;DR

This work tackles the inefficiency of full fine-tuning by introducing SHIP, a parameter-efficient fine-tuning method that leverages semantic hierarchies within pre-trained Vision Transformers. SHIP constructs semantic levels from inter-layer affinities, assigns three prompt types (Semantic Independent Prompts, Semantic Shared Prompts, and Attribute Prompts), and enforces discriminative learning through a Prompt Matching Loss and Decoupled Attention to preserve pre-trained attention. The approach yields consistent improvements over VPT and competitive PEFT methods on VTAB-1k without extensively increasing trainable parameters. Overall, SHIP provides a robust, scalable strategy for task-specific fine-tuning that enhances feature aggregation and discrimination while maintaining efficiency.

Abstract

As the scale of vision models continues to grow, Visual Prompt Tuning (VPT) has emerged as a parameter-efficient transfer learning technique, noted for its superior performance compared to full fine-tuning. However, indiscriminately applying prompts to every layer without considering their inherent correlations, can cause significant disturbances, leading to suboptimal transferability. Additionally, VPT disrupts the original self-attention structure, affecting the aggregation of visual features, and lacks a mechanism for explicitly mining discriminative visual features, which are crucial for classification. To address these issues, we propose a Semantic Hierarchical Prompt (SHIP) fine-tuning strategy. We adaptively construct semantic hierarchies and use semantic-independent and semantic-shared prompts to learn hierarchical representations. We also integrate attribute prompts and a prompt matching loss to enhance feature discrimination and employ decoupled attention for robustness and reduced inference costs. SHIP significantly improves performance, achieving a 4.9% gain in accuracy over VPT with a ViT-B/16 backbone on VTAB-1k tasks. Our code is available at https://github.com/haoweiz23/SHIP.

Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning

TL;DR

This work tackles the inefficiency of full fine-tuning by introducing SHIP, a parameter-efficient fine-tuning method that leverages semantic hierarchies within pre-trained Vision Transformers. SHIP constructs semantic levels from inter-layer affinities, assigns three prompt types (Semantic Independent Prompts, Semantic Shared Prompts, and Attribute Prompts), and enforces discriminative learning through a Prompt Matching Loss and Decoupled Attention to preserve pre-trained attention. The approach yields consistent improvements over VPT and competitive PEFT methods on VTAB-1k without extensively increasing trainable parameters. Overall, SHIP provides a robust, scalable strategy for task-specific fine-tuning that enhances feature aggregation and discrimination while maintaining efficiency.

Abstract

As the scale of vision models continues to grow, Visual Prompt Tuning (VPT) has emerged as a parameter-efficient transfer learning technique, noted for its superior performance compared to full fine-tuning. However, indiscriminately applying prompts to every layer without considering their inherent correlations, can cause significant disturbances, leading to suboptimal transferability. Additionally, VPT disrupts the original self-attention structure, affecting the aggregation of visual features, and lacks a mechanism for explicitly mining discriminative visual features, which are crucial for classification. To address these issues, we propose a Semantic Hierarchical Prompt (SHIP) fine-tuning strategy. We adaptively construct semantic hierarchies and use semantic-independent and semantic-shared prompts to learn hierarchical representations. We also integrate attribute prompts and a prompt matching loss to enhance feature discrimination and employ decoupled attention for robustness and reduced inference costs. SHIP significantly improves performance, achieving a 4.9% gain in accuracy over VPT with a ViT-B/16 backbone on VTAB-1k tasks. Our code is available at https://github.com/haoweiz23/SHIP.

Paper Structure

This paper contains 11 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Examples of feature maps in the continuous transformer layers of ViT-B/16 vit. Semantic patterns exhibit similarity between adjacent pairs of layers and gradually evolve in deeper layers. (b) The averaged inter-layer affinity matrix is computed across all training samples from three typical datasets. (c) The statistical mean and variance of the affinity between the $i$-th and $(i+1)$-th layers are analyzed across six typical datasets.
  • Figure 2: Overview of our proposed SHIP (Semantic HIerarchical Prompt) fine-tuning framework. SHIP computes inter-layer affinity using features derived from the pre-trained model. Based on this affinity, a semantic hierarchy is established through greedy search. SHIP then learns specific prompts for each semantic level, integrating them with prompt matching loss and decoupled attention to enhance model performance.
  • Figure 3: Loss comparison between VPT and SHIP.