Table of Contents
Fetching ...

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

Hongyu Qu, Ling Xing, Jiachao Zhang, Rui Yan, Yazhou Yao, Xiangbo Shu

TL;DR

This paper tackles few-shot action recognition by addressing a cognitive limitation in prior work that learns video representations in isolation. It introduces HR$^{2}$G-shot, a hierarchical framework that unifies inter-frame, inter-video, and inter-task relations, featuring Inter-video Semantic Correlation (ISC) and Inter-task Knowledge Transfer (IKT) to learn task-specific temporal patterns from a holistic view. ISC enables fine-grained cross-video interactions within a task, while IKT leverages a temporal knowledge bank and temporal prototypes to transfer knowledge across tasks, improving generalization to unseen classes. Extensive experiments on five standard FSAR datasets show that HR$^{2}$G-shot achieves state-of-the-art performance with a CLIP-ViT-B backbone, validating the effectiveness of hierarchical relational reasoning and memory-based knowledge transfer for rapid adaptation in few-shot settings.

Abstract

Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

TL;DR

This paper tackles few-shot action recognition by addressing a cognitive limitation in prior work that learns video representations in isolation. It introduces HRG-shot, a hierarchical framework that unifies inter-frame, inter-video, and inter-task relations, featuring Inter-video Semantic Correlation (ISC) and Inter-task Knowledge Transfer (IKT) to learn task-specific temporal patterns from a holistic view. ISC enables fine-grained cross-video interactions within a task, while IKT leverages a temporal knowledge bank and temporal prototypes to transfer knowledge across tasks, improving generalization to unseen classes. Extensive experiments on five standard FSAR datasets show that HRG-shot achieves state-of-the-art performance with a CLIP-ViT-B backbone, validating the effectiveness of hierarchical relational reasoning and memory-based knowledge transfer for rapid adaptation in few-shot settings.

Abstract

Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

Paper Structure

This paper contains 33 sections, 7 equations, 8 figures, 13 tables, 1 algorithm.

Figures (8)

  • Figure 1: Our main idea. (a): Previous FSAR works only rely on inter-frame relation modeling to learn video representations, ignoring the relations between videos and tasks. (b): In contrast, we unify three types of relation modeling (i.e., inter-frame, inter-video, inter-task) under one single framework, so as to capture task-specific temporal cues.
  • Figure 2: The overview of HR$^{2}$G-shot. (a) HR$^{2}$G-shot unifies three types of relation modeling (i.e., inter-frame, inter-video, and inter-task) to learn discriminative temporal features. (b) Inter-video Semantic Correlation (ISC) conducts fine-grained cross-video interactions to learn inter-video relationships. (c) To explore inter-task relationships, we retrieve and aggregate temporal knowledge from the bank, which maintains diverse temporal patterns from historical tasks.
  • Figure 3: Different masked interaction strategies for Inter-video Semantic Correlation.
  • Figure 4: The impact of temporal bank size $G$ in IKT on SSv2-small goyal2017something and HMDB51 kuehne2011hmdb under the $5$-way $1$-shot setting (§\ref{['sec:banksize']}).
  • Figure 5: Similarity visualization between query samples ($q_{n}$) and support prototypes ($s_{n}$) with different methods in a meta-test episode from HMDB51 kuehne2011hmdb (see §\ref{['sec:simi']}). A higher score indicates a greater degree of similarity. The green box indicates correct prediction and the red box indicates incorrect prediction.
  • ...and 3 more figures