Table of Contents
Fetching ...

Text-Enhanced Zero-Shot Action Recognition: A training-free approach

Massimo Bosetti, Shibingfeng Zhang, Benedetta Liberatori, Giacomo Zara, Elisa Ricci, Paolo Rota

TL;DR

TEAR tackles zero-shot video action recognition by leveraging a language model to generate rich text descriptors for action classes and a frozen image-based VLM to perform inference with these descriptors, all without training on video data. By decomposing actions into sequential sub-actions and adding contextual descriptions, TEAR bridges the semantic gap between verb-heavy action labels and visual sequences. Empirical results on UCF101, HMDB51, and Kinetics-600 show TEAR achieving competitive or superior performance to training-based approaches while remaining inference-only. This text-driven, training-free approach reduces resource requirements and broadens open-vocabulary video understanding with potential extensions to untrimmed videos.

Abstract

Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.

Text-Enhanced Zero-Shot Action Recognition: A training-free approach

TL;DR

TEAR tackles zero-shot video action recognition by leveraging a language model to generate rich text descriptors for action classes and a frozen image-based VLM to perform inference with these descriptors, all without training on video data. By decomposing actions into sequential sub-actions and adding contextual descriptions, TEAR bridges the semantic gap between verb-heavy action labels and visual sequences. Empirical results on UCF101, HMDB51, and Kinetics-600 show TEAR achieving competitive or superior performance to training-based approaches while remaining inference-only. This text-driven, training-free approach reduces resource requirements and broadens open-vocabulary video understanding with potential extensions to untrimmed videos.

Abstract

Vision-language models (VLMs) have demonstrated remarkable performance across various visual tasks, leveraging joint learning of visual and textual representations. While these models excel in zero-shot image tasks, their application to zero-shot video action recognition (ZSVAR) remains challenging due to the dynamic and temporal nature of actions. Existing methods for ZS-VAR typically require extensive training on specific datasets, which can be resource-intensive and may introduce domain biases. In this work, we propose Text-Enhanced Action Recognition (TEAR), a simple approach to ZS-VAR that is training-free and does not require the availability of training data or extensive computational resources. Drawing inspiration from recent findings in vision and language literature, we utilize action descriptors for decomposition and contextual information to enhance zero-shot action recognition. Through experiments on UCF101, HMDB51, and Kinetics-600 datasets, we showcase the effectiveness and applicability of our proposed approach in addressing the challenges of ZS-VAR.
Paper Structure (15 sections, 3 equations, 4 figures, 6 tables)

This paper contains 15 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of the proposed method. TEAR addresses the task of zero-shot action recognition. First, for every action class label $y$, we generate a set of action textual descriptors $\mathcal{D}(y)$ by querying an LLM. Then we compute the textual and visual embeddings, keeping both the image and text encoders frozen (). Lastly, the final prediction is obtained by computing the similarity between the textual embeddings and the averaged visual embeddings.
  • Figure 2: Examples of descriptors matching visual cues in test videos. We show descriptors generated for four videos of Kinetics-600. We show four frames for each video and highlight the matching with the decomposition, description, and context. For each video, the label above represents the ground truth label.
  • Figure 3: Example of descriptors that do not match visual cues in test videos. We show descriptors generated for one video of Kinetics-600 of the class kissing. We show four frames from the video and highlight the matching with the decomposition, description, and context. For this sample, the textual descriptors do not match the visual cues in the video. Further qualitative analyses are available in the supplementary material.
  • Figure 4: Ablation on using the textual descriptors. We ablate the use of different textual descriptors defined in Sec. \ref{['subsec:descriptors']}. We report the Top1 accuracy on the three datasets and use the same color coding as in Sec. \ref{['subsec:descriptors']}.