Table of Contents
Fetching ...

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Yehna Kim, Young-Eun Kim, Seong-Whan Lee

TL;DR

The paper tackles semantic ambiguity in zero-shot action recognition by replacing sole reliance on action class labels with language-driven descriptive attributes (DAs) extracted from web descriptions via a large language model. It introduces a Spatio-Temporal Interaction (STI) module that aligns these DA embeddings with video content at fine-grained spatial and temporal scales, using a CLIP-based backbone and symmetric cross-entropy objectives. Empirical results across zero-shot, few-shot, and fully-supervised settings on UCF-101, HMDB-51, Kinetics-600, and Kinetics-400 demonstrate state-of-the-art or competitive performance, with notable gains from the STI design and optimal attribute count (N_a = 8). The approach reduces manual annotation costs, improves semantic grounding, and shows strong transferability across tasks, signaling practical impact for scalable video understanding in diverse domains.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

TL;DR

The paper tackles semantic ambiguity in zero-shot action recognition by replacing sole reliance on action class labels with language-driven descriptive attributes (DAs) extracted from web descriptions via a large language model. It introduces a Spatio-Temporal Interaction (STI) module that aligns these DA embeddings with video content at fine-grained spatial and temporal scales, using a CLIP-based backbone and symmetric cross-entropy objectives. Empirical results across zero-shot, few-shot, and fully-supervised settings on UCF-101, HMDB-51, Kinetics-600, and Kinetics-400 demonstrate state-of-the-art or competitive performance, with notable gains from the STI design and optimal attribute count (N_a = 8). The approach reduces manual annotation costs, improves semantic grounding, and shows strong transferability across tasks, signaling practical impact for scalable video understanding in diverse domains.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.

Paper Structure

This paper contains 27 sections, 11 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Example of misclassified data due to the ambiguity of action classes. The model incorrectly infers "salsa spin" as "swing" or "tennis swing", due to multi-semantic word swing. This error illustrates the need for additional semantic information beyond action class labels.
  • Figure 2: Illustration of the difference between label-specific attributes (a), video-specific attributes (b) and ours (c). Our approach eliminates the manual annotation process and achieves zero-shot performance using only label-specific attributes.
  • Figure 3: An overview of our model framework. The architecture consists of spatial interaction and temporal interaction modules.
  • Figure 4: Details of spatio-temporal interaction. (a) Spatial interaction: Patch embeddings from each frame $(T \times N_p \times D)$ and attribute word embeddings $(N_w \times D)$ are projected to a common space. Patch-word similarities are computed and max-pooled across words (per patch) and then max-pooled across patches (per frame) to yield spatial features $f_{sp} \in \mathbb{R}^{T \times 1}$. (b) Temporal interaction: Word-frame similarities are softmax-normalized along time and averaged across word to produce temporal saliency $\boldsymbol{S}_{temp} \in \mathbb{R}^{T \times 1}$; these weights re-scale the video embedding to obtain the final spatio-temporal feature $(T\times D)$.
  • Figure 5: Qualitative comparison of attribute keywords. For two representative video samples, we show selected frames and attribute keywords produced by BIKE (baseline attributes) and by our description attributes (proposed attributes). Green text marks attributes that are closely aligned with the ground truth class, whereas red text marks attributes that are weakly related. Our method yields action and scene relevant descriptor compared to BIKE attributes.
  • ...and 1 more figures