Table of Contents
Fetching ...

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Congqi Cao, Yueran Zhang, Yating Yu, Qinyi Lv, Lingtong Min, Yanning Zhang

TL;DR

This work tackles overfitting in few-shot action recognition by freezing a pre-trained image backbone and introducing Task-Adapter adapters in the final Vision Transformer layers. Task-specific self-attention is performed across multiple videos within a given task via a frozen Task-MSA, enabling discriminative features to be extracted during feature extraction rather than post hoc. The approach yields state-of-the-art results across HMDB51, UCF101, Kinetics, and especially SSv2, with strong robustness to different pretraining regimes (ImageNet and CLIP) and metric modules. Practically, this method offers a scalable, parameter-efficient path to leverage large pre-trained models for few-shot video understanding, reducing overfitting while enhancing task-specific discriminability.

Abstract

Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

TL;DR

This work tackles overfitting in few-shot action recognition by freezing a pre-trained image backbone and introducing Task-Adapter adapters in the final Vision Transformer layers. Task-specific self-attention is performed across multiple videos within a given task via a frozen Task-MSA, enabling discriminative features to be extracted during feature extraction rather than post hoc. The approach yields state-of-the-art results across HMDB51, UCF101, Kinetics, and especially SSv2, with strong robustness to different pretraining regimes (ImageNet and CLIP) and metric modules. Practically, this method offers a scalable, parameter-efficient path to leverage large pre-trained models for few-shot video understanding, reducing overfitting while enhancing task-specific discriminability.

Abstract

Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.
Paper Structure (15 sections, 5 equations, 5 figures, 5 tables)

This paper contains 15 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: AIM (a) adapts the standard ViT bock (b) by freezing the original pre-trained model (outlined with a blue background) and adding tunable Adapters (c) individually for temporal adaptation, spatial adaptation and joint adaptation.
  • Figure 2: Illustration of our method. Note that we only add Adapters into the last $L$ ViT layers. In Task-Adapter, we introduce task adaptation after T-MSA and S-MSA to enhance the task-specific information for the few-shot action recognition. After feature extraction, the video features are passed to the metric module to compute the classification scores. The upper figure illustrates the computational process of the Task-Adapter given a 2-way 1-shot learning task with two query videos in the query set.
  • Figure 3: Comparison of the performance achieved by combining different fine-tuning strategies with existing widely used metric measurement methods for 5-way 1-shot task on the challenging SSv2-Small.
  • Figure 4: Effect of inserting Task-Adapters into the last $L$ ViT layers (e.g., $L$ = 1, 2, 3, 6, 12) on scene-related datasets.
  • Figure 5: Visualizations of the attention map of "High Jump" and "Pole Vault" for a given few-shot learning task obtained by baseline AIM and our Task-Adapter. Our method is able to pay more attention to the most discriminative area of the actions with the help of task-specific adaptation.