Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Yehna Kim, Young-Eun Kim, Seong-Whan Lee
TL;DR
The paper tackles semantic ambiguity in zero-shot action recognition by replacing sole reliance on action class labels with language-driven descriptive attributes (DAs) extracted from web descriptions via a large language model. It introduces a Spatio-Temporal Interaction (STI) module that aligns these DA embeddings with video content at fine-grained spatial and temporal scales, using a CLIP-based backbone and symmetric cross-entropy objectives. Empirical results across zero-shot, few-shot, and fully-supervised settings on UCF-101, HMDB-51, Kinetics-600, and Kinetics-400 demonstrate state-of-the-art or competitive performance, with notable gains from the STI design and optimal attribute count (N_a = 8). The approach reduces manual annotation costs, improves semantic grounding, and shows strong transferability across tasks, signaling practical impact for scalable video understanding in diverse domains.
Abstract
Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
