Table of Contents
Fetching ...

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu Kong

TL;DR

This paper proposes a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR) and shows that the OpenMixer performs the best over baselines for detecting seen and unseen actions.

Abstract

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes, models, and dataset splits at https://github.com/Cogito2012/OpenMixer.

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

TL;DR

This paper proposes a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR) and shows that the OpenMixer performs the best over baselines for detecting seen and unseen actions.

Abstract

Action detection aims to detect (recognize and localize) human actions spatially and temporally in videos. Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories. However, this constrained setting is not viable in an open world where test videos inevitably come beyond the trained action categories. In this paper, we address the practical yet challenging Open-Vocabulary Action Detection (OVAD) problem. It aims to detect any action in test videos while training a model on a fixed set of action categories. To achieve such an open-vocabulary capability, we propose a novel method OpenMixer that exploits the inherent semantics and localizability of large vision-language models (VLM) within the family of query-based detection transformers (DETR). Specifically, the OpenMixer is developed by spatial and temporal OpenMixer blocks (S-OMB and T-OMB), and a dynamically fused alignment (DFA) module. The three components collectively enjoy the merits of strong generalization from pre-trained VLMs and end-to-end learning from DETR design. Moreover, we established OVAD benchmarks under various settings, and the experimental results show that the OpenMixer performs the best over baselines for detecting seen and unseen actions. We release the codes, models, and dataset splits at https://github.com/Cogito2012/OpenMixer.

Paper Structure

This paper contains 18 sections, 1 equation, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Framework (left) and the OpenMixer Block (right). Given a video and an open vocabulary of actions, we use prompted classes and a pre-trained video VLM to obtain all kinds of VLM features. With a stack of cascaded OpenMixer blocks and spatial-temporal queries, the model predicts the action scores, person boxes, and their associated person scores for the OVAD task.
  • Figure 2: Spatial and Temporal OMB, and DFA. In \ref{['fig:somb', 'fig:tomb']}, the Q-Q and Q-V mixing modules aim to mix information among queries and across query-visual features, respectively. S-OMB is in Sec. \ref{['sec:S-OMB']} where the dashed arrow is only used at the 1st stage. T-OMB is in Sec. \ref{['sec:TOMB']} and DFA is in Sec. \ref{['sec:dfa']}.
  • Figure 3: Hyperparameters. We show the video mAP with respect to different numbers of learnable queries and OMB stages.
  • Figure 4: Unseen Action Detection. We visualize our OpenMixer detections (in blue) and ground truth (in yellow) on two representative videos from novel classes. The numbers after class names are confidence scores. More visualizations are in \ref{['supsec:vis']}.
  • Figure 5: Generated prompts for J-HMDB action categories. For each category, we generate one prompt sentence.
  • ...and 3 more figures