Table of Contents
Fetching ...

Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun, Yihan Wang, Xinzhu Ma, Zhihui Wang, Kun Lu, Zhiyong Wang

TL;DR

This work addresses fine-grained action recognition by introducing Action-Region Tracking (ART), which explicitly discovers and tracks discriminative local regions across frames through a query-response mechanism. ART uses a Region-Specific Semantics Activation module with text-constrained queries to generate region responses, organizes them into action tracklets, and optimizes them with a Multi-level Tracklet Contrastive Loss, complemented by a task-specific fine-tuning of a text-derived semantics bank. The method achieves state-of-the-art performance on FineGym and Diving48 with competitive complexity, demonstrating improved discrimination of subtle motion patterns and enhanced interpretability via tracklet-based representations. The approach offers a practical framework for robust FGAR and potential generalization to conventional action recognition tasks in real-world video analysis scenarios.

Abstract

Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

TL;DR

This work addresses fine-grained action recognition by introducing Action-Region Tracking (ART), which explicitly discovers and tracks discriminative local regions across frames through a query-response mechanism. ART uses a Region-Specific Semantics Activation module with text-constrained queries to generate region responses, organizes them into action tracklets, and optimizes them with a Multi-level Tracklet Contrastive Loss, complemented by a task-specific fine-tuning of a text-derived semantics bank. The method achieves state-of-the-art performance on FineGym and Diving48 with competitive complexity, demonstrating improved discrimination of subtle motion patterns and enhanced interpretability via tracklet-based representations. The approach offers a practical framework for robust FGAR and potential generalization to conventional action recognition tasks in real-world video analysis scenarios.

Abstract

Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

Paper Structure

This paper contains 27 sections, 16 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Examples of “salto backward tucked” and its variant with 1 twist reveal the fine-grained nature of action recognition, characterized by large intra-class variation and subtle inter-class differences. Accurate discrimination relies more on capturing temporal dynamics of local movements than on appearance or context.
  • Figure 2: (a) The visualization illustrates the motivation behind our Action-Region Tracking (ART) framework. Our ART aims to identifies and tracks discriminative action regions across multiple local areas that evolve over time. Here, $\textbf{X}$ represents semantic features extracted from the backbone network, while $\textbf{Tr}$ denotes features processed by ART. (b) and (c) give the class activation maps and the response regions contributing to action prediction without and with ART, respectively. We can see that the backbone network tends to concentrate on easily distinguishable regions, often overlooking the dynamics of local details. In contrast, Our ART framework focuses on discriminative action regions over time
  • Figure 3: The overall framework of our proposed ART. The backbone extracts the feature from an input video, the Spatial Semantic Extraction (SSE) component enhances the feature with the spatial context, the Region-Specific Semantics Activation (RSSA) component captures region-specific semantic responses from enhanced region-wise representations, and the Tracklet Generation (TG) component forms a group of action tracklets, i.e., a group of responses to the same position queries from all video frames along the temporal dimension. Finally, tracklet based representations are integrated into a global representation through Tracklet Aggergation (TA), obtaining the video's final recognition result. Furthermore, we transform action label descriptions into action phrases aligned with VLMs’ textual lexicon, forming a text-constrained semantics bank.
  • Figure 4: Illustration of process of task-specific textual semantic bank fine-tuning. We implement a task-specific fine-tuning mechanism for updating $\textbf{S}$, ensuring that the textual semantics retain the semantic representations encoded by VLMs while fine-tuning to align with task-specific preferences. The agent textual semantic bank $\textbf{S}^a$ is optimized by a video consistency loss and a prototype consistency loss to narrowed the distance between categories across video and text modalities.
  • Figure 5: Impact of (a) the correlation degree of $\lambda$ and (b) the number of action tracklets on FineGym99, Diving48 and NTU60-XView in terms of Top-1 accuracy
  • ...and 3 more figures