TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

Renjie Liang; Li Li; Chongzhi Zhang; Jing Wang; Xizhou Zhu; Aixin Sun

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

Renjie Liang, Li Li, Chongzhi Zhang, Jing Wang, Xizhou Zhu, Aixin Sun

TL;DR

This work defines Ranked Video Moment Retrieval (RVMR) to capture realistic moment search where queries are imprecise and multiple moments may be relevant. It introduces TVR-Ranking, a TVR-derived dataset with 3,281 queries and 94,442 annotated query–moment relevances, plus a novel evaluation metric, $NDCG@K$ with $IoU\geq \mu$, to assess ranking quality and localization. The authors adapt three VCMR baselines (XML, CONQUER, ReLoCLNet) with query–moment similarity weighting and demonstrate that effective ranking for RVMR differs from traditional VCMR, with ReLoCLNet yielding strongest results under sufficient pseudo-training. This dataset and metric enable exploration of multi-modal ranking and retrieval strategies for video moments, highlighting the gap between existing VCMR methods and practical search needs. The work thus provides benchmarks and baselines that guide development of ranking-aware, cross-modal video moment retrieval systems.

Abstract

In this paper, we propose the task of \textit{Ranked Video Moment Retrieval} (RVMR) to locate a ranked list of matching moments from a collection of videos, through queries in natural language. Although a few related tasks have been proposed and studied by CV, NLP, and IR communities, RVMR is the task that best reflects the practical setting of moment search. To facilitate research in RVMR, we develop the TVR-Ranking dataset, based on the raw videos and existing moment annotations provided in the TVR dataset. Our key contribution is the manual annotation of relevance levels for 94,442 query-moment pairs. We then develop the $NDCG@K, IoU\geq μ$ evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at \url{https://github.com/Ranking-VMR/TVR-Ranking}

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

TL;DR

with

, to assess ranking quality and localization. The authors adapt three VCMR baselines (XML, CONQUER, ReLoCLNet) with query–moment similarity weighting and demonstrate that effective ranking for RVMR differs from traditional VCMR, with ReLoCLNet yielding strongest results under sufficient pseudo-training. This dataset and metric enable exploration of multi-modal ranking and retrieval strategies for video moments, highlighting the gap between existing VCMR methods and practical search needs. The work thus provides benchmarks and baselines that guide development of ranking-aware, cross-modal video moment retrieval systems.

Abstract

evaluation metric for this new task and conduct experiments to evaluate three baseline models. Our experiments show that the new RVMR task brings new challenges to existing models and we believe this new dataset contributes to the research on multi-modality search. The dataset is available at \url{https://github.com/Ranking-VMR/TVR-Ranking}

Paper Structure (21 sections, 1 equation, 7 figures, 8 tables)

This paper contains 21 sections, 1 equation, 7 figures, 8 tables.

Introduction
Related work
The TVR-Ranking Dataset Annotation
Imprecise Queries
Relevance Annotation and Quality Control
Pseudo Training Set Generation
TVR-Ranking: Statistics
Evaluation Metric for RVMR
Baseline Performance
Conclusion and Limitations
Character Name Substitution
Annotation Guideline, Annotator, and Annotation Analysis
Annotation Guideline
Annotation Setup and Annotators
Annotation Analysis: Relevance Level vs Moment-Caption Similarity
...and 6 more sections

Figures (7)

Figure 1: RVMR and its related tasks. A rectangle represents a video; the matching moment to text query $Q$ is shaded. In VR VMR and VCMR, exactly one video/moment is to be retrieved.
Figure 2: Illustration (a) IoU, and (b)--(d) for $NDCG@3, \mu=0.3$. (b) $p_1$ matches $g_3$ with $rel=2$ for the larger $IoU$, above the 0.3 threshold. (c) $p_2$ matches $g_1$ as $g_3$ is no longer available. (d) $p_3$ matches $g_4$, with $rel=2$.
Figure 3: A prompt example and the subsequent conversation with ChatGPT for characters' names replacement.
Figure 4: Relationship between relevance score (1 - 4) and $sim(q, m.c)$ with annotations for 10 sample queries. The candidate moments of the same query are in one color. In (b), moment ranking positions are by their similarity scores $sim(q, m.c)$ in descending order.
Figure 5: (a) Distribution of the relevant scores in all raw annotations by two or four annotators. (b) Distribution of the final scores after discarding annotations in disagreement.
...and 2 more figures

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

TL;DR

Abstract

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

Authors

TL;DR

Abstract

Table of Contents

Figures (7)