Table of Contents
Fetching ...

OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

Qi Liu, Weiying Xue, Yuxiao Wang, Zhenao Wei

TL;DR

OpenVidVRD tackles open-vocabulary VidVRD by transferring vision-language model capabilities to video relational tasks through region-caption prompts and a spatiotemporal refiner. The framework combines an open-vocabulary tracklet detector, a visual–text aggregation module, a STRM to capture spatiotemporal cues, and a dynamic prompting scheme that fuses learnable and hand-crafted prompts for semantic alignment, optimized with a multi-component loss $L = \gamma L_{obj\&sub} + L_{rel} + \delta L_{int}$. Empirical results on VidVRD and VidOR show state-of-the-art performance, including large gains on unseen relations ($28.86\%$ mAP for all relations and $47.73\%$ for unseen) and substantial improvements in SGDet/Novel-split settings, demonstrating strong generalization and cross-modal alignment. The work highlights the practical potential of prompt-driven semantic space alignment to extend VLMs to dynamic video relations, enabling robust open-vocabulary understanding in real-world scenarios.

Abstract

The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.

OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment

TL;DR

OpenVidVRD tackles open-vocabulary VidVRD by transferring vision-language model capabilities to video relational tasks through region-caption prompts and a spatiotemporal refiner. The framework combines an open-vocabulary tracklet detector, a visual–text aggregation module, a STRM to capture spatiotemporal cues, and a dynamic prompting scheme that fuses learnable and hand-crafted prompts for semantic alignment, optimized with a multi-component loss . Empirical results on VidVRD and VidOR show state-of-the-art performance, including large gains on unseen relations ( mAP for all relations and for unseen) and substantial improvements in SGDet/Novel-split settings, demonstrating strong generalization and cross-modal alignment. The work highlights the practical potential of prompt-driven semantic space alignment to extend VLMs to dynamic video relations, enabling robust open-vocabulary understanding in real-world scenarios.

Abstract

The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.

Paper Structure

This paper contains 20 sections, 15 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: (a) An illustration of the conventional VidVRD vs. Open-vocabulary VidVRD. Conventional VidVRD cannot predict any novel categories, including both predicates and objects. (b) shows the long-tailed distribution of the predicate classes in the VidOR dataset.
  • Figure 2: Overview of our framework. The proposed framework breaks down VidVRD into two main tasks: detecting object trajectories and classifying relations. Initially, trajectories for all objects are extracted from the video, followed by the generation of region captions from the caption model. Afterward, a module for aggregating visual and text information is created to combine data from different modes. To capture both spatial and temporal contexts, we use a spatiotemporal refiner that takes in subject, object, union, background, and motion information in sequence. To examine how visual regions align with relationships, we combine both learnable and hand-crafted prompts to adjust a vision-language model for predicting new relations.
  • Figure 3: Overview of visual-text aggregation module and spatiotemporal refiner. The visual-text aggregation module combines visual and textual representations using self-attention and dot-cross attention. It further refines these features through feedforward networks and spatial pooling operations, allowing it to effectively capture the correlations between visual contents and region captions. Meanwhile, The spatiotemporal refiner module employs a spatiotemporal transformer to examine dynamic state variations, updating visual representations successfully.
  • Figure 4: Ablation study for prompt lengths. The results illustrate how different prompt lengths impact the outcomes, using blue for 4 tokens, red for 8 tokens, and green for 16 tokens. The horizontal axis represents the token percentage, from 0$\%$ (all tokens from learnable prompts) to 100$\%$ (all from hand-crafted prompts).
  • Figure 5: Qualitative examples of our model. The novel relations detected by the model are marked in yellow background.
  • ...and 1 more figures