Table of Contents
Fetching ...

A Comprehensive Review of Few-shot Action Recognition

Yuyang Wanyan, Xiaoshan Yang, Weiming Dong, Changsheng Xu

TL;DR

The paper addresses the challenge of recognizing human actions in videos with limited labeled examples by surveying FSAR methods. It presents a novel taxonomy that separates generative-based approaches from meta-learning, and within meta-learning it analyzes three core facets: video instance representation, category prototype learning, and generalized video alignment, all framed in episodic $N$-way $K$-shot evaluations. Benchmarks such as HMDB, UCF101, Kinetics, SSv2, and EPIC-Kitchens are surveyed, along with advances across skeleton-based, multimodal, unsupervised, cross-domain, incremental, and federated FSAR. The authors discuss future directions including larger, more realistic datasets, modality expansion, and leveraging large language or vision-language models to enhance generalization and applicability. Overall, the review provides a foundational resource and a roadmap for researchers and practitioners developing robust FSAR systems.

Abstract

Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data in action recognition. It requires accurately classifying human actions in videos using only a few labeled examples per class. Compared to few-shot learning in image scenarios, few-shot action recognition is more challenging due to the intrinsic complexity of video data. Numerous approaches have driven significant advancements in few-shot action recognition, which underscores the need for a comprehensive survey. Unlike early surveys that focus on few-shot image or text classification, we deeply consider the unique challenges of few-shot action recognition. In this survey, we provide a comprehensive review of recent methods and introduce a novel and systematic taxonomy of existing approaches, accompanied by a detailed analysis. We categorize the methods into generative-based and meta-learning frameworks, and further elaborate on the methods within the meta-learning framework, covering aspects: video instance representation, category prototype learning, and generalized video alignment. Additionally, the survey presents the commonly used benchmarks and discusses relevant advanced topics and promising future directions. We hope this survey can serve as a valuable resource for researchers, offering essential guidance to newcomers and stimulating seasoned researchers with fresh insights.

A Comprehensive Review of Few-shot Action Recognition

TL;DR

The paper addresses the challenge of recognizing human actions in videos with limited labeled examples by surveying FSAR methods. It presents a novel taxonomy that separates generative-based approaches from meta-learning, and within meta-learning it analyzes three core facets: video instance representation, category prototype learning, and generalized video alignment, all framed in episodic -way -shot evaluations. Benchmarks such as HMDB, UCF101, Kinetics, SSv2, and EPIC-Kitchens are surveyed, along with advances across skeleton-based, multimodal, unsupervised, cross-domain, incremental, and federated FSAR. The authors discuss future directions including larger, more realistic datasets, modality expansion, and leveraging large language or vision-language models to enhance generalization and applicability. Overall, the review provides a foundational resource and a roadmap for researchers and practitioners developing robust FSAR systems.

Abstract

Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data in action recognition. It requires accurately classifying human actions in videos using only a few labeled examples per class. Compared to few-shot learning in image scenarios, few-shot action recognition is more challenging due to the intrinsic complexity of video data. Numerous approaches have driven significant advancements in few-shot action recognition, which underscores the need for a comprehensive survey. Unlike early surveys that focus on few-shot image or text classification, we deeply consider the unique challenges of few-shot action recognition. In this survey, we provide a comprehensive review of recent methods and introduce a novel and systematic taxonomy of existing approaches, accompanied by a detailed analysis. We categorize the methods into generative-based and meta-learning frameworks, and further elaborate on the methods within the meta-learning framework, covering aspects: video instance representation, category prototype learning, and generalized video alignment. Additionally, the survey presents the commonly used benchmarks and discusses relevant advanced topics and promising future directions. We hope this survey can serve as a valuable resource for researchers, offering essential guidance to newcomers and stimulating seasoned researchers with fresh insights.
Paper Structure (36 sections, 7 figures, 4 tables, 1 algorithm)

This paper contains 36 sections, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: A taxonomically organized chronological timeline of few-shot action recognition methodologies, highlighting key developments and emerging trends. The plot illustrates the field's growth by showcasing the cumulative temporal distribution of publications from 2018 to 2025. The red underline signifies the method employs a language–image pretraining backbone (e.g., CLIP), while others adopt a unimodal pretraining backbone (e.g., ResNet).
  • Figure 2: Overview of the organization for the survey.
  • Figure 3: Comparison of the action recognition and few-shot action recognition tasks.
  • Figure 4: Performance of typical few-shot action recognition methods on Kinetics dataset carreira2017kinetics in 5-way 1-shot and 5-way 5-shot settings.
  • Figure 5: The general framework of few-shot action recognition with meta-learning.
  • ...and 2 more figures