Table of Contents
Fetching ...

Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Wei-Jhe Huang, Min-Hung Chen, Shang-Hong Lai

TL;DR

This work tackles zero-shot spatio-temporal action detection by adapting CLIP to reason about per-person actions within their spatio-temporal context. It introduces ST-CLIP, a two-stage framework that couples Person-Context Interaction with a multi-layer Context Prompting module and an Interest Token Spotting mechanism to generate per-person, action-specific text representations without extra interaction heads. The approach achieves strong generalization on J-HMDB and competitive results on UCF101-24, while demonstrably handling multiple actions within AVA videos. By leveraging pretrained visual-language knowledge and designing targeted prompting and token-selection strategies, the method narrows the gap toward real-world open-set video understanding and provides a comprehensive zero-shot benchmark across three datasets.

Abstract

Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in https://webber2933.github.io/ST-CLIP-project-page.

Spatio-Temporal Context Prompting for Zero-Shot Action Detection

TL;DR

This work tackles zero-shot spatio-temporal action detection by adapting CLIP to reason about per-person actions within their spatio-temporal context. It introduces ST-CLIP, a two-stage framework that couples Person-Context Interaction with a multi-layer Context Prompting module and an Interest Token Spotting mechanism to generate per-person, action-specific text representations without extra interaction heads. The approach achieves strong generalization on J-HMDB and competitive results on UCF101-24, while demonstrably handling multiple actions within AVA videos. By leveraging pretrained visual-language knowledge and designing targeted prompting and token-selection strategies, the method narrows the gap toward real-world open-set video understanding and provides a comprehensive zero-shot benchmark across three datasets.

Abstract

Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in https://webber2933.github.io/ST-CLIP-project-page.
Paper Structure (31 sections, 4 equations, 7 figures, 15 tables)

This paper contains 31 sections, 4 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Overview of our method. We aim to transfer the knowledge of CLIP to detect unseen actions. We leverage the pretrained knowledge to model the interaction between people and their surrounding context. Besides, the Interest Token Spotting mechanism utilizes the knowledge to find the tokens most relevant to a person, then the Context Prompting uses these visual tokens to augment the text contents, which can make them easier to distinguish.
  • Figure 2: ST-CLIP framework. We first extract the person tokens for the person bounding boxes detected from each frame. Then, we perform temporal modeling on the neighboring frames to obtain the context tokens. After that, we leverage the CLIP’s visual knowledge to perform person-context interaction on these tokens. In addition, we utilize the attention weight in each encoder layer to find the interest tokens for each person, then the Context Prompting layer will use these visual tokens to prompt the class names. Finally, the cosine similarities between person-context relational tokens and the label prompting features determine the classification scores for the actions.
  • Figure 3: Temporal modeling. We apply self-attention along the temporal dimension to fuse the information.
  • Figure 4: The impact of interest tokens.
  • Figure 5: More visualization of interest tokens. Bold text indicate unseen actions.
  • ...and 2 more figures