Table of Contents
Fetching ...

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Tongjia Chen, Hongshan Yu, Zhengeng Yang, Zechuan Li, Wei Sun, Chen Chen

TL;DR

This work targets the textual knowledge gap in open-vocabulary video recognition by transforming action category names into Spatio-Temporal Descriptors via an LLM and aligning video frames to these descriptors with an Optimal Descriptor Solver based on entropy-regularized Optimal Transport. The method decouples static and dynamic aspects of actions, reduces semantic overlap among category names, and adaptively matches frames to descriptors, yielding strong zero-shot performance (including $75.1\%$ on Kinetics-600) and robust results across few-shot and fully-supervised settings without altering the underlying model architecture. Core contributions include the Spatio-Temporal Descriptor framework, the OD Solver formulation, and comprehensive demonstrations of improved generalization, efficiency, and interpretability across six benchmarks. Overall, OST provides a practical, extensible path to open-vocabulary video understanding by leveraging external knowledge and principled frame-to-descriptor alignment.

Abstract

Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

TL;DR

This work targets the textual knowledge gap in open-vocabulary video recognition by transforming action category names into Spatio-Temporal Descriptors via an LLM and aligning video frames to these descriptors with an Optimal Descriptor Solver based on entropy-regularized Optimal Transport. The method decouples static and dynamic aspects of actions, reduces semantic overlap among category names, and adaptively matches frames to descriptors, yielding strong zero-shot performance (including on Kinetics-600) and robust results across few-shot and fully-supervised settings without altering the underlying model architecture. Core contributions include the Spatio-Temporal Descriptor framework, the OD Solver formulation, and comprehensive demonstrations of improved generalization, efficiency, and interpretability across six benchmarks. Overall, OST provides a practical, extensible path to open-vocabulary video understanding by leveraging external knowledge and principled frame-to-descriptor alignment.

Abstract

Due to the resource-intensive nature of training vision-language models on expansive video data, a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names, leading to less distinct semantic space and potential performance limitations. In this work, we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover, to assign the best descriptors with different video instances, we propose Optimal Descriptor Solver, forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot, few-shot, and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.
Paper Structure (29 sections, 20 equations, 12 figures, 7 tables)

This paper contains 29 sections, 20 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Motivation of our method. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the textual discrepancy between descriptive narratives and concise category names. This oversight results in a less separable latent space, which may hinder video recognition.
  • Figure 2: Sanity check on category names. We investigate the semantic distribution of video category names (Left) and quantify the semantic density of category names (Right). We observe a higher semantic similarity of category names on K400 and Sthv2 compared to ImageNet. Our proposed Spatio-Temporal Descriptor can greatly reduce the semantic similarity in latent space. Please refer to Sec. \ref{['sec: LLMAug']} for comprehensive details.
  • Figure 3: An overview of our pipeline for video recognition. We query the Large Language Model to augment category names to generate corresponding Category Descriptors. The descriptors disentangled category names into Spatio-Temporal Descriptors for static visual cues and temporal evolution, respectively. To fully refine the textual knowledge, we propose Optimal Descriptor Solver that adaptively aligns descriptors with video frames. An optimal matching flow is calculated through the iterative solving of the entropy-regularized OT problem to assign optimal descriptors for each video instance. Please zoom in for comprehensive details.
  • Figure 4: Attention map on K600 validation set. We demonstrate Spatio Descriptors and Temporal Descriptors on the left and right, respectively. (Left): For videos that can be recognized via static frames, our OST attends to the certain object more while ViFi-CLIP vificlip is often distracted by the backgrounds. (Right): For classes that require more temporal clues, ViFi-CLIP vificlip attends to appearance (e.g. soccer ball and soccer field) more, while our OST shows consistent attention to the body's temporal salient parts such as the player's feet.
  • Figure 5: Generalization on extreme outliers. We utilize the text-to-video diffusion model Show-1 zhang2023show to generate synthetic videos with a semantic distribution distinct from the fine-tuning data in Kinetics-400 to further demonstrate the generalizability of our method. Attention map for Spatio Descriptors and Temporal Descriptors are visualized on the left and right, respectively.
  • ...and 7 more figures