OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment
Qi Liu, Weiying Xue, Yuxiao Wang, Zhenao Wei
TL;DR
OpenVidVRD tackles open-vocabulary VidVRD by transferring vision-language model capabilities to video relational tasks through region-caption prompts and a spatiotemporal refiner. The framework combines an open-vocabulary tracklet detector, a visual–text aggregation module, a STRM to capture spatiotemporal cues, and a dynamic prompting scheme that fuses learnable and hand-crafted prompts for semantic alignment, optimized with a multi-component loss $L = \gamma L_{obj\&sub} + L_{rel} + \delta L_{int}$. Empirical results on VidVRD and VidOR show state-of-the-art performance, including large gains on unseen relations ($28.86\%$ mAP for all relations and $47.73\%$ for unseen) and substantial improvements in SGDet/Novel-split settings, demonstrating strong generalization and cross-modal alignment. The work highlights the practical potential of prompt-driven semantic space alignment to extend VLMs to dynamic video relations, enabling robust open-vocabulary understanding in real-world scenarios.
Abstract
The video visual relation detection (VidVRD) task is to identify objects and their relationships in videos, which is challenging due to the dynamic content, high annotation costs, and long-tailed distribution of relations. Visual language models (VLMs) help explore open-vocabulary visual relation detection tasks, yet often overlook the connections between various visual regions and their relations. Moreover, using VLMs to directly identify visual relations in videos poses significant challenges because of the large disparity between images and videos. Therefore, we propose a novel open-vocabulary VidVRD framework, termed OpenVidVRD, which transfers VLMs' rich knowledge and powerful capabilities to improve VidVRD tasks through prompt learning. Specificall y, We use VLM to extract text representations from automatically generated region captions based on the video's regions. Next, we develop a spatiotemporal refiner module to derive object-level relationship representations in the video by integrating cross-modal spatiotemporal complementary information. Furthermore, a prompt-driven strategy to align semantic spaces is employed to harness the semantic understanding of VLMs, enhancing the overall generalization ability of OpenVidVRD. Extensive experiments conducted on the VidVRD and VidOR public datasets show that the proposed model outperforms existing methods.
