Table of Contents
Fetching ...

Temporal Alignment-Free Video Matching for Few-shot Action Recognition

SuBeen Lee, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo

TL;DR

This paper addresses FSAR by overcoming the rigidity and inefficiency of frame- and tuple-based temporal alignment. It introduces TEmporal Alignment-free Matching (TEAM), which encodes videos with a fixed set of learnable pattern tokens via cross-attention, enabling direct token-wise matching without predefined temporal units. The framework deploys instance and exclusive pattern tokens, plus an episode-wise adaptation to remove class-shared information, improving discrimination for novel classes. Extensive experiments on HMDB51, Kinetics, UCF101, and SSv2-Small show state-of-the-art performance and reduced matching complexity compared with alignment-based methods. The approach demonstrates strong generalization, cross-domain robustness, and practical efficiency, with code available at GitHub.

Abstract

Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances. A key challenge in FSAR is handling divergent narrative trajectories for precise video matching. While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching. Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility. Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM. Codes are available at github.com/leesb7426/TEAM.

Temporal Alignment-Free Video Matching for Few-shot Action Recognition

TL;DR

This paper addresses FSAR by overcoming the rigidity and inefficiency of frame- and tuple-based temporal alignment. It introduces TEmporal Alignment-free Matching (TEAM), which encodes videos with a fixed set of learnable pattern tokens via cross-attention, enabling direct token-wise matching without predefined temporal units. The framework deploys instance and exclusive pattern tokens, plus an episode-wise adaptation to remove class-shared information, improving discrimination for novel classes. Extensive experiments on HMDB51, Kinetics, UCF101, and SSv2-Small show state-of-the-art performance and reduced matching complexity compared with alignment-based methods. The approach demonstrates strong generalization, cross-domain robustness, and practical efficiency, with code available at GitHub.

Abstract

Few-Shot Action Recognition (FSAR) aims to train a model with only a few labeled video instances. A key challenge in FSAR is handling divergent narrative trajectories for precise video matching. While the frame- and tuple-level alignment approaches have been promising, their methods heavily rely on pre-defined and length-dependent alignment units (e.g., frames or tuples), which limits flexibility for actions of varying lengths and speeds. In this work, we introduce a novel TEmporal Alignment-free Matching (TEAM) approach, which eliminates the need for temporal units in action representation and brute-force alignment during matching. Specifically, TEAM represents each video with a fixed set of pattern tokens that capture globally discriminative clues within the video instance regardless of action length or speed, ensuring its flexibility. Furthermore, TEAM is inherently efficient, using token-wise comparisons to measure similarity between videos, unlike existing methods that rely on pairwise comparisons for temporal alignment. Additionally, we propose an adaptation process that identifies and removes common information across classes, establishing clear boundaries even between novel categories. Extensive experiments demonstrate the effectiveness of TEAM. Codes are available at github.com/leesb7426/TEAM.

Paper Structure

This paper contains 26 sections, 18 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison with alignment-based approaches on the "Diving Cliff" class. (a) Frame Alignment: For each frame of the query video, the most corresponding frame in the support video is identified, highlighting the importance of precise frame-level alignment. (b) Tuple Alignment: The support and query videos are compared in sub-sequence units to account for variations in action speed. The set of varying tuples is pre-defined. (c) Temporal Alignment-free Matching: Both in support and query, the video features are initially integrated using pattern tokens, which encode globally discriminative patterns. Then, it directly compares the corresponding aggregated features of the support and query videos. Pattern-based aggregation is both efficient and flexible, as it does not require an alignment process and is unaffected by differences in frame count or action speed. Note the text description for each pattern token (green) is intended to provide an intuitive understanding of what each pattern token represents.
  • Figure 2: Illustration of our approach using a single pattern token ($M=1$) in the 3-way 3-shot scenario without query. This figure illustrates our approach, where instance (+) and exclusive (-) pattern tokens represent integrated video features based on the pattern token. (Left) Randomly initialized pattern tokens are optimized with classes in two complementary ways. First, instance pattern tokens (+) are encouraged to cluster with other instances' tokens of the same class while repelling those of other categories. On the other hand, exclusive pattern tokens (-) learn to represent the otherness of each instance by positioning themselves in the embedding space of other classes. (Right) Although these two types of tokens are discriminative for distinguishing base video categories, they may not fully capture the finer details needed for novel categories. To address this, we propose an adaptation process of support pattern tokens for novel classes to refine the class decision boundaries. Note that when multiple pattern tokens are used, these processes run in parallel, with instance and exclusive tokens being compared only within the same pattern token.
  • Figure 3: Ablation study for the number of pattern tokens.
  • Figure 4: N-way 1-shot and 5-way K-shot results.
  • Figure 5: Visualization of cross-attention weights during instance pattern token aggregation. The same pattern tokens are visualized for videos in each class. Green regions in each frame metaphorically represent the percentage of the frame involvement in the instance pattern token. (a) For 'Diving Cliff', the visualized token primarily responds to the moment when water splashes when people dive. (b) In 'Playing Trumpet', the token focuses on the scenes where people hold a trumpet to their mouths.
  • ...and 2 more figures