End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

Yongqi Wang; Xinxiao Wu; Shuo Yang; Jiebo Luo

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

Yongqi Wang, Xinxiao Wu, Shuo Yang, Jiebo Luo

TL;DR

Open-VidVRD aims to detect visual relationships in videos for unseen object and relation categories, addressing the limitations of relying on pre-trained, closed-set trajectory detectors. The authors propose an end-to-end framework comprising a relationship-aware open-vocabulary trajectory detector and an open-vocabulary relationship classifier that jointly leverage CLIP’s knowledge through a distillation-based Transformer and multi-modal prompting. A relationship query and an auxiliary relationship loss enable the detector to explicitly model object relationships, while spatio-temporal visual prompting and vision-guided language prompting enhance the relationship classifier’s open-vocabulary generalization. Experiments on VidVRD and VidOR show consistent gains, especially in novel-category settings, and cross-dataset evaluation demonstrates strong generalization capabilities, validating the approach's practical impact for open-vocabulary video understanding.

Abstract

Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

TL;DR

Abstract

Paper Structure (45 sections, 27 equations, 9 figures, 9 tables)

This paper contains 45 sections, 27 equations, 9 figures, 9 tables.

Introduction
Related Work
Video Visual Relationship Detection
Open-vocabulary Visual Relationship Detection
Prompting CLIP
Our framework
Overview
Relationship-aware Open-vocabulary Trajectory Detection
Frame-wise Open-vocabulary Object Detection
Auxiliary Object Classification
Trajectory Association
Training Loss
Open-vocabulary Relationship Classification
Spatio-temporal Visual Prompting
Vision-guided Language Prompting
...and 30 more sections

Figures (9)

Figure 1: (a) Existing Open-VidVRD methods rely on trajectory detectors trained on closed datasets. (b) The proposed end-to-end model performs Open-VidVRD directly on the original videos.
Figure 2: (a) The proposed end-to-end framework, where the object trajectories and their categories are predicted by the relationship-aware open-vocabulary trajectory detector, and the relationship categories are predicted by the open-vocabulary relationship classifier. (b) The relationship-aware open-vocabulary trajectory detector. (c) The open-vocabulary relationship classifier.
Figure 3: Results of different token numbers of vision-guided language prompts on the VidVRD dataset. Different colors denote different token numbers, i.e., the blue, orange, and green colors represent the 8,16, and 32 tokens. The horizontal axis represents the percentage of tokens from conditional prompts, i.e., from 0 (all tokens are from learnable continuous prompts) to 100% (all tokens are from learnable conditional prompts). (a) and (b) show the results of using different tokens on the mAP$_o$ metric in the auxiliary object classifier. (c) and (d) show the results of using different tokens on the mAP metric in the open-vocabulary relationship classifier.
Figure 4: Results of different values of the hyperparameters $\alpha$ and $\beta$ on the VidVRD dataset. The horizontal axis represents the values of the parameter, and the vertical axis represents the mAP$_o$ performance. (a) shows the results of different values of $\alpha$ while keeping $\beta=0.5$. (b) shows the results of different values of $\beta$ while keeping $\beta=0.3$.
Figure 5: Results of different values of loss function coefficients on the VidVRD dataset. The horizontal axis represents the values of the coefficients. The left and right vertical axes represent the results of the novel and all categories. (a) shows the results of different values of $\lambda_s$ while keeping $\lambda_4=1$ and $\lambda_5=1$. (b) shows the results of different values of $\lambda_4$ while keeping $\lambda_s=2$ and $\lambda_5=1$. (c) shows the results of different values of $\lambda_5$ while keeping $\lambda_4=2$ and $\lambda_5=2$. (d) shows the results of different values of $\gamma$ while keeping $\delta=0$. (e) shows the results of different values of $\delta$ while keeping $\gamma=0.2$.
...and 4 more figures

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

TL;DR

Abstract

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (9)