PRVR: Partially Relevant Video Retrieval
Xianke Chen, Daizong Liu, Xun Yang, Xirong Li, Jianfeng Dong, Meng Wang, Xun Wang
TL;DR
This work introduces Partially Relevant Video Retrieval (PRVR), a realistic setting where untrimmed videos may contain moments relevant to a textual query. It frames PRVR as a multiple instance learning problem and proposes MS-SL++, a dual-branch network that learns clip-scale and frame-scale similarities in a coarse-to-fine manner, augmented by length-aware clip clustering and Key Clip Guided Attention. Training combines triplet ranking and infoNCE losses per scale; inference fuses clip- and frame-scale scores through $S(v,q)=\alpha S_c(v,q)+(1-\alpha) S_f(v,q)$. Experiments on TVR, ActivityNet-Captions, and Charades-STA demonstrate strong gains over T2VR baselines and show PRVR’s potential as a first stage for VCMR and a supportive step for SVMR, with practical benefits in accuracy and efficiency. The work advances realistic video-text retrieval by explicitly modeling partial relevance and multi-scale semantic alignment.
Abstract
In current text-to-video retrieval (T2VR), videos to be retrieved have been properly trimmed so that a correspondence between the videos and ad-hoc textual queries naturally exists. Note in practice that videos circulated on the Internet and social media platforms, while being relatively short, are typically rich in their content. Often, multiple scenes / actions / events are shown in a single video, leading to a more challenging T2VR setting wherein only part of the video content is relevant w.r.t. a given query. This paper presents a first study on this setting which we term Partially Relevant Video Retrieval (PRVR). Considering that a video typically consists of multiple moments, a video is regarded as partially relevant w.r.t. to a given query if it contains a query-related moment. We formulate the PRVR task as a multiple instance learning problem, and propose a Multi-Scale Similarity Learning (MS-SL++) network that jointly learns both clip-scale and frame-scale similarities to determine the partial relevance between video-query pairs. Extensive experiments on three diverse video-text datasets (TVshow Retrieval, ActivityNet-Captions and Charades-STA) demonstrate the viability of the proposed method.
