Table of Contents
Fetching ...

Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation

Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You

TL;DR

DyT is proposed, a novel approach to improve both parameter and inference efficiency for ViT adaptation by proposing a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference.

Abstract

Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency. However, the exploration of enhancing inference efficiency during adaptation remains underexplored. This limits the broader application of pre-trained ViT models, especially when the model is computationally extensive. In this paper, we propose Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for ViT adaptation. Specifically, besides using the lightweight adapter modules, we propose a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference. Additionally, we explore multiple design variants to find the best practice of DyT. Finally, inspired by the mixture-of-experts (MoE) mechanism, we introduce an enhanced adapter to further boost the adaptation performance. We validate DyT across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.

Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation

TL;DR

DyT is proposed, a novel approach to improve both parameter and inference efficiency for ViT adaptation by proposing a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference.

Abstract

Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency. However, the exploration of enhancing inference efficiency during adaptation remains underexplored. This limits the broader application of pre-trained ViT models, especially when the model is computationally extensive. In this paper, we propose Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for ViT adaptation. Specifically, besides using the lightweight adapter modules, we propose a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference. Additionally, we explore multiple design variants to find the best practice of DyT. Finally, inspired by the mixture-of-experts (MoE) mechanism, we introduce an enhanced adapter to further boost the adaptation performance. We validate DyT across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
Paper Structure (59 sections, 9 equations, 9 figures, 19 tables)

This paper contains 59 sections, 9 equations, 9 figures, 19 tables.

Figures (9)

  • Figure 1: FLOPs and Accuracy of ViT-B/16 dosovitskiy2020image on VTAB-1K zhai2019large. "Full tuning" denotes that all parameters are fine-tuned. AdaptFormer chen2022adaptformer, LoRA hu2021lora and VPT jia2022visual are typical PEFT methods.
  • Figure 2: Overview of Dynamic Tuning. (a) In the fine-tuning stage, we adopt Gumbel Noise to enable end-to-end training. (b)In the inference stage, $\operatorname{TD}$ selects $K$ activated tokens $\mathbf{X}_s$ from $\mathbf{X}$ based on the mask $\mathbf{M}$, which saves the computations on those deactivated tokens in $\operatorname{Block}$. $\operatorname{Block}$ can represent a $\operatorname{Attn}$ block, a $\operatorname{MLP}$ block, or an entire transformer layer.
  • Figure 3: Model variants. For brevity, we omit the LayerNorm ba2016layer in $\operatorname{Attn}$ and $\operatorname{MLP}$ blocks. "DyT" denotes the dynamic tuning presented in Figure \ref{['fig:main']}.
  • Figure 4: The architecture of the MoE-adapter. It is consist of $N$ adapter experts.
  • Figure 5: Token activation rate in different layers. We visualize the token activation rates in ViT-B/16. "Overall" denotes the mean activation rate in the whole model, which arrives at around 50% when $r$ is set to 0.5. "Layer0" and "Layer11" denote the lowest and highest level, respectively. Notably, the activation rate in the last layer is exactly 0% on CIFAR-100, SVHN, and K400 datasets.
  • ...and 4 more figures