Table of Contents
Fetching ...

PRVR: Partially Relevant Video Retrieval

Xianke Chen, Daizong Liu, Xun Yang, Xirong Li, Jianfeng Dong, Meng Wang, Xun Wang

TL;DR

This work introduces Partially Relevant Video Retrieval (PRVR), a realistic setting where untrimmed videos may contain moments relevant to a textual query. It frames PRVR as a multiple instance learning problem and proposes MS-SL++, a dual-branch network that learns clip-scale and frame-scale similarities in a coarse-to-fine manner, augmented by length-aware clip clustering and Key Clip Guided Attention. Training combines triplet ranking and infoNCE losses per scale; inference fuses clip- and frame-scale scores through $S(v,q)=\alpha S_c(v,q)+(1-\alpha) S_f(v,q)$. Experiments on TVR, ActivityNet-Captions, and Charades-STA demonstrate strong gains over T2VR baselines and show PRVR’s potential as a first stage for VCMR and a supportive step for SVMR, with practical benefits in accuracy and efficiency. The work advances realistic video-text retrieval by explicitly modeling partial relevance and multi-scale semantic alignment.

Abstract

In current text-to-video retrieval (T2VR), videos to be retrieved have been properly trimmed so that a correspondence between the videos and ad-hoc textual queries naturally exists. Note in practice that videos circulated on the Internet and social media platforms, while being relatively short, are typically rich in their content. Often, multiple scenes / actions / events are shown in a single video, leading to a more challenging T2VR setting wherein only part of the video content is relevant w.r.t. a given query. This paper presents a first study on this setting which we term Partially Relevant Video Retrieval (PRVR). Considering that a video typically consists of multiple moments, a video is regarded as partially relevant w.r.t. to a given query if it contains a query-related moment. We formulate the PRVR task as a multiple instance learning problem, and propose a Multi-Scale Similarity Learning (MS-SL++) network that jointly learns both clip-scale and frame-scale similarities to determine the partial relevance between video-query pairs. Extensive experiments on three diverse video-text datasets (TVshow Retrieval, ActivityNet-Captions and Charades-STA) demonstrate the viability of the proposed method.

PRVR: Partially Relevant Video Retrieval

TL;DR

This work introduces Partially Relevant Video Retrieval (PRVR), a realistic setting where untrimmed videos may contain moments relevant to a textual query. It frames PRVR as a multiple instance learning problem and proposes MS-SL++, a dual-branch network that learns clip-scale and frame-scale similarities in a coarse-to-fine manner, augmented by length-aware clip clustering and Key Clip Guided Attention. Training combines triplet ranking and infoNCE losses per scale; inference fuses clip- and frame-scale scores through . Experiments on TVR, ActivityNet-Captions, and Charades-STA demonstrate strong gains over T2VR baselines and show PRVR’s potential as a first stage for VCMR and a supportive step for SVMR, with practical benefits in accuracy and efficiency. The work advances realistic video-text retrieval by explicitly modeling partial relevance and multi-scale semantic alignment.

Abstract

In current text-to-video retrieval (T2VR), videos to be retrieved have been properly trimmed so that a correspondence between the videos and ad-hoc textual queries naturally exists. Note in practice that videos circulated on the Internet and social media platforms, while being relatively short, are typically rich in their content. Often, multiple scenes / actions / events are shown in a single video, leading to a more challenging T2VR setting wherein only part of the video content is relevant w.r.t. a given query. This paper presents a first study on this setting which we term Partially Relevant Video Retrieval (PRVR). Considering that a video typically consists of multiple moments, a video is regarded as partially relevant w.r.t. to a given query if it contains a query-related moment. We formulate the PRVR task as a multiple instance learning problem, and propose a Multi-Scale Similarity Learning (MS-SL++) network that jointly learns both clip-scale and frame-scale similarities to determine the partial relevance between video-query pairs. Extensive experiments on three diverse video-text datasets (TVshow Retrieval, ActivityNet-Captions and Charades-STA) demonstrate the viability of the proposed method.
Paper Structure (38 sections, 12 equations, 10 figures, 6 tables)

This paper contains 38 sections, 12 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Difference between current T2VR task and our proposed PRVR task. In T2VR, a target video is typically pre-trimmed and almost fully relevant to the corresponding query, which is too idealized compared to real-world retrieval scenarios. By contrast, a target video in PRVR is untrimmed and diverse with query-irrelevant content, which is regarded as partially relevant to the query.
  • Figure 2: Illustration of the connection of the proposed PRVR task with existing video/moment retrieval tasks. In particular, PRVR can be regarded as a more practical but challenging subtask of T2VR, and serves as a crucial intermediate step for VCMR and SVMR tasks by providing videos partially related to the query
  • Figure 3: The framework of our proposed model MS-SL++ for PRVR. Given an untrimmed video, the multi-scale video representation module encodes it into both clip-scale and frame-scale features. For more efficient coarse-to-fine similarity computation, a length-aware clustering module is applied to select representative clip-level features. Meanwhile, the simple yet effective sentence representation module will encode the text query into a single feature vector. Then, the coarse relevance between the video and textual query is obtained by clip-scale similarity learning, whilst a key clip is detected. Moreover, the key clip is utilized for guiding the aggregation of frame-scale features and thus output the fine-grained relevance of the video and query. The partial similarity between an untrimmed video and a query is determined by the multi-scale similarities concurrently.
  • Figure 4: Comparison between length-aware clustering and vanilla clustering in terms of their (a) retrieval performance and (b) the length diversity of the obtained representative clips by clustering.
  • Figure 5: Impact of clip granularity on retrieval performance.
  • ...and 5 more figures