Table of Contents
Fetching ...

MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization

Zhenying Fang, Richang Hong

TL;DR

MGCA-Net addresses open-vocabulary temporal action localization by introducing a four-component architecture that separately localizes actions, predicts action presence, classifies base actions, and identifies novel actions via a coarse-to-fine classifier built on vision-language features. The approach enables multi-grained category awareness by combining proposal-level base-action classification with video-level identification of novel categories through MIL and cross-modal text-image similarity, followed by per-proposal fine-grained labeling. Extensive experiments on THUMOS'14 and ActivityNet-1.3 show state-of-the-art results for OV-TAL and Zero-Shot TAL, with notable improvements in novel-action localization and without requiring prompt tuning. The work highlights the value of decoupling base and novel category predictions and leveraging multiple templates to capture diverse textual representations, offering a practical path toward robust open-vocabulary video understanding.

Abstract

Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.

MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization

TL;DR

MGCA-Net addresses open-vocabulary temporal action localization by introducing a four-component architecture that separately localizes actions, predicts action presence, classifies base actions, and identifies novel actions via a coarse-to-fine classifier built on vision-language features. The approach enables multi-grained category awareness by combining proposal-level base-action classification with video-level identification of novel categories through MIL and cross-modal text-image similarity, followed by per-proposal fine-grained labeling. Extensive experiments on THUMOS'14 and ActivityNet-1.3 show state-of-the-art results for OV-TAL and Zero-Shot TAL, with notable improvements in novel-action localization and without requiring prompt tuning. The work highlights the value of decoupling base and novel category predictions and leveraging multiple templates to capture diverse textual representations, offering a practical path toward robust open-vocabulary video understanding.

Abstract

Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.

Paper Structure

This paper contains 21 sections, 7 equations, 3 figures, 8 tables, 2 algorithms.

Figures (3)

  • Figure 1: The structural comparison of (a) the localization-then-classification method, (b) the text information injection method, and (c) our proposed method.
  • Figure 2: Overview of the proposed MGCA-Net. MGCA-Net employs frozen video, image, and text encoders to extract video, image, and text features. Based on video features, the localizer localizes category-agnostic action proposals. In parallel, the action presence predictor and conventional classifier predict the action presence score (APS) for each action proposal and its probabilities over base action categories. Based on the APS and base action probabilities, action proposals are categorized into base action instances, novel proposals, or discarded. The coarse-to-fine classifier determines the action categories of novel proposals. Specifically, the coarse-to-fine classifier first identifies all action categories in the video, yielding coarse categories. Subsequently, it extracts proposal features for each novel proposal and assigns an action category based on the similarity between the proposal features and the text features of the coarse categories, resulting in novel action instances. The final localization results are the union of base and novel action instances.
  • Figure 3: Visualization of the localization results produced by the baseline and our MGCA-Net on THUMOS'14, where the ground truth labels are also provided. The baseline refers to our method with the conventional classifier, action presence predictor, and coarse-to-fine classifier removed—specifically, it uses a single template for action categories and predicts action categories based on the zero-shot capability of VLMs.