MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization
Zhenying Fang, Richang Hong
TL;DR
MGCA-Net addresses open-vocabulary temporal action localization by introducing a four-component architecture that separately localizes actions, predicts action presence, classifies base actions, and identifies novel actions via a coarse-to-fine classifier built on vision-language features. The approach enables multi-grained category awareness by combining proposal-level base-action classification with video-level identification of novel categories through MIL and cross-modal text-image similarity, followed by per-proposal fine-grained labeling. Extensive experiments on THUMOS'14 and ActivityNet-1.3 show state-of-the-art results for OV-TAL and Zero-Shot TAL, with notable improvements in novel-action localization and without requiring prompt tuning. The work highlights the value of decoupling base and novel category predictions and leveraging multiple templates to capture diverse textual representations, offering a practical path toward robust open-vocabulary video understanding.
Abstract
Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.
