Table of Contents
Fetching ...

Hierarchical Side-Tuning for Vision Transformers

Weifeng Lin, Ziheng Wu, Wentao Yang, Mingxin Huang, Jun Huang, Lianwen Jin

TL;DR

This work tackles the high resource cost of adapting large Vision Transformers (ViTs) to downstream tasks, especially dense prediction, by introducing Hierarchical Side-Tuning (HST). HST builds a lightweight Hierarchical Side Network (HSN) that processes multi-scale image features and leverages intermediate ViT activations via a Meta-Register and a Transformation Bridge to fuse information efficiently. Across classification and dense prediction benchmarks, HST achieves state-of-the-art results among PETL methods and even rivals full fine-tuning on several tasks, notably attaining an average VTAB-1K Top-1 accuracy of $76.1\%$ with only $0.78$M trainable parameters. The method reduces the gap between PETL and full fine-tuning in dense prediction, demonstrating strong practical potential for versatile, parameter-efficient transfer learning on large ViTs.

Abstract

Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks. Yet, the demand for individualized and comprehensive fine-tuning processes for each task entails substantial computational and memory costs, posing a considerable challenge. Recent advancements in Parameter-Efficient Transfer Learning (PETL) have shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning. However, their effectiveness is primarily observed in simple tasks like image classification, while they encounter challenges with more complex vision tasks like dense prediction. To address this gap, this study aims to identify an effective tuning method that caters to a wider range of visual tasks. In this paper, we introduce Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks. Diverging from existing methods that focus solely on fine-tuning parameters within specific input spaces or modules, HST employs a lightweight Hierarchical Side Network (HSN). This network leverages intermediate activations from the ViT backbone to model multi-scale features, enhancing prediction capabilities. To evaluate HST, we conducted comprehensive experiments across a range of visual tasks, including classification, object detection, instance segmentation, and semantic segmentation. Remarkably, HST achieved state-of-the-art performance in 13 out of the 19 tasks on the VTAB-1K benchmark, with the highest average Top-1 accuracy of 76.1%, while fine-tuning a mere 0.78M parameters. When applied to object detection and semantic segmentation tasks on the COCO and ADE20K testdev benchmarks, HST outperformed existing PETL methods and even surpassed full fine-tuning.

Hierarchical Side-Tuning for Vision Transformers

TL;DR

This work tackles the high resource cost of adapting large Vision Transformers (ViTs) to downstream tasks, especially dense prediction, by introducing Hierarchical Side-Tuning (HST). HST builds a lightweight Hierarchical Side Network (HSN) that processes multi-scale image features and leverages intermediate ViT activations via a Meta-Register and a Transformation Bridge to fuse information efficiently. Across classification and dense prediction benchmarks, HST achieves state-of-the-art results among PETL methods and even rivals full fine-tuning on several tasks, notably attaining an average VTAB-1K Top-1 accuracy of with only M trainable parameters. The method reduces the gap between PETL and full fine-tuning in dense prediction, demonstrating strong practical potential for versatile, parameter-efficient transfer learning on large ViTs.

Abstract

Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks. Yet, the demand for individualized and comprehensive fine-tuning processes for each task entails substantial computational and memory costs, posing a considerable challenge. Recent advancements in Parameter-Efficient Transfer Learning (PETL) have shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning. However, their effectiveness is primarily observed in simple tasks like image classification, while they encounter challenges with more complex vision tasks like dense prediction. To address this gap, this study aims to identify an effective tuning method that caters to a wider range of visual tasks. In this paper, we introduce Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks. Diverging from existing methods that focus solely on fine-tuning parameters within specific input spaces or modules, HST employs a lightweight Hierarchical Side Network (HSN). This network leverages intermediate activations from the ViT backbone to model multi-scale features, enhancing prediction capabilities. To evaluate HST, we conducted comprehensive experiments across a range of visual tasks, including classification, object detection, instance segmentation, and semantic segmentation. Remarkably, HST achieved state-of-the-art performance in 13 out of the 19 tasks on the VTAB-1K benchmark, with the highest average Top-1 accuracy of 76.1%, while fine-tuning a mere 0.78M parameters. When applied to object detection and semantic segmentation tasks on the COCO and ADE20K testdev benchmarks, HST outperformed existing PETL methods and even surpassed full fine-tuning.
Paper Structure (43 sections, 5 equations, 8 figures, 9 tables)

This paper contains 43 sections, 5 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Previous paradigm vs. our paradigm, including Adapter, Prompt Tuning, LoRA and our Hierarchical Side-Tuning.
  • Figure 2: Overall architecture of HST. The Blue Section represents the plain ViT, with its weights kept frozen. The Green Section is referred to as the Transformation Bridge (T-Bridge). The Pink Section is the proposed Hierarchical Side Network (HSN), composed of a convolutional stem followed by a sequence of $L$ Side blocks.
  • Figure 3: Left: Meta-Register and layer norm tuning. Right: Comparisons of cosine similarity between the output features of Meta-Register and input image tokens.
  • Figure 4: Transformation Bridge.
  • Figure 5: Side Block. (a) The schematic illustration of the proposed Side Block. (b) Illustration of linear complexity of cross-attention in Side block.
  • ...and 3 more figures