Table of Contents
Fetching ...

Understanding the Cross-Domain Capabilities of Video-Based Few-Shot Action Recognition Models

Georgia Markham, Mehala Balamurali, Andrew J. Hill

TL;DR

The paper tackles cross-domain few-shot action recognition in video by evaluating five representative FSAR methods across cross-domain settings generated from five action datasets. It quantifies domain shift using Maximum Mean Discrepancy ($MMD$) and systematically analyzes how base dataset design and domain difference affect downstream few-shot performance. Key findings show that simple transfer-learning approaches outperform others as domain shift grows; specialized cross-domain methods can underperform, especially with limited novel data, and temporal alignment techniques often fail to generalize to unseen domains. The work highlights the importance of base-data composition and provides guidance for dataset design and evaluation of CD-FSAR, offering a foundation for future method development in real-world, cross-domain video understanding.

Abstract

Few-shot action recognition (FSAR) aims to learn a model capable of identifying novel actions in videos using only a few examples. In assuming the base dataset seen during meta-training and novel dataset used for evaluation can come from different domains, cross-domain few-shot learning alleviates data collection and annotation costs required by methods with greater supervision and conventional (single-domain) few-shot methods. While this form of learning has been extensively studied for image classification, studies in cross-domain FSAR (CD-FSAR) are limited to proposing a model, rather than first understanding the cross-domain capabilities of existing models. To this end, we systematically evaluate existing state-of-the-art single-domain, transfer-based, and cross-domain FSAR methods on new cross-domain tasks with increasing difficulty, measured based on the domain shift between the base and novel set. Our empirical meta-analysis reveals a correlation between domain difference and downstream few-shot performance, and uncovers several important insights into which model aspects are effective for CD-FSAR and which need further development. Namely, we find that as the domain difference increases, the simple transfer-learning approach outperforms other methods by over 12 percentage points, and under these more challenging cross-domain settings, the specialised cross-domain model achieves the lowest performance. We also witness state-of-the-art single-domain FSAR models which use temporal alignment achieving similar or worse performance than earlier methods which do not, suggesting existing temporal alignment techniques fail to generalise on unseen domains. To the best of our knowledge, we are the first to systematically study the CD-FSAR problem in-depth. We hope the insights and challenges revealed in our study inspires and informs future work in these directions.

Understanding the Cross-Domain Capabilities of Video-Based Few-Shot Action Recognition Models

TL;DR

The paper tackles cross-domain few-shot action recognition in video by evaluating five representative FSAR methods across cross-domain settings generated from five action datasets. It quantifies domain shift using Maximum Mean Discrepancy () and systematically analyzes how base dataset design and domain difference affect downstream few-shot performance. Key findings show that simple transfer-learning approaches outperform others as domain shift grows; specialized cross-domain methods can underperform, especially with limited novel data, and temporal alignment techniques often fail to generalize to unseen domains. The work highlights the importance of base-data composition and provides guidance for dataset design and evaluation of CD-FSAR, offering a foundation for future method development in real-world, cross-domain video understanding.

Abstract

Few-shot action recognition (FSAR) aims to learn a model capable of identifying novel actions in videos using only a few examples. In assuming the base dataset seen during meta-training and novel dataset used for evaluation can come from different domains, cross-domain few-shot learning alleviates data collection and annotation costs required by methods with greater supervision and conventional (single-domain) few-shot methods. While this form of learning has been extensively studied for image classification, studies in cross-domain FSAR (CD-FSAR) are limited to proposing a model, rather than first understanding the cross-domain capabilities of existing models. To this end, we systematically evaluate existing state-of-the-art single-domain, transfer-based, and cross-domain FSAR methods on new cross-domain tasks with increasing difficulty, measured based on the domain shift between the base and novel set. Our empirical meta-analysis reveals a correlation between domain difference and downstream few-shot performance, and uncovers several important insights into which model aspects are effective for CD-FSAR and which need further development. Namely, we find that as the domain difference increases, the simple transfer-learning approach outperforms other methods by over 12 percentage points, and under these more challenging cross-domain settings, the specialised cross-domain model achieves the lowest performance. We also witness state-of-the-art single-domain FSAR models which use temporal alignment achieving similar or worse performance than earlier methods which do not, suggesting existing temporal alignment techniques fail to generalise on unseen domains. To the best of our knowledge, we are the first to systematically study the CD-FSAR problem in-depth. We hope the insights and challenges revealed in our study inspires and informs future work in these directions.
Paper Structure (36 sections, 13 figures, 10 tables)

This paper contains 36 sections, 13 figures, 10 tables.

Figures (13)

  • Figure 1: An illustrative comparison between single-domain FSL, domain adaptation, and cross-domain FSL. This paper concerns cross-domain FSL for action recognition in video.
  • Figure 2: Domain differences for base dataset combinations used for evaluation in \ref{['sec:results']}.
  • Figure 3: 5-Way 5-Shot performance of existing single-domain (ProtoNet, STRM, MoLo), transfer-based (Transfer), and cross-domain (CDFSL-V) FSAR models trained using different base datasets (indicated by colour), and evaluated on different novel datasets (each subfigure). Each model was also tested untrained with random weights (pink bars). The raw values are found in \ref{['app:raw_5shot_results']}.
  • Figure 4: Measure of domain difference against downstream few-shot performance for each model.
  • Figure 5: Example frame sequences from videos within the HMDB51 dataset, with their corresponding action label.
  • ...and 8 more figures