Table of Contents
Fetching ...

MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors

Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung

TL;DR

This work addresses the faithfulness of Video Moment Retrieval (VMR) when faced with misinformation by introducing Massive Videos Moment Retrieval (MVMR), which retrieves moments for a query from a massive pool containing multiple distractors. It proposes an automated dataset construction framework using semantic distance verification (textual with SimCSE and visual with EMScore) to create three practical MVMR benchmarks, with human validation showing low mislabel rates. To counter misinformation, CroCs—a cross-directional informative sample-weighted contrastive learning method built on MMN—combines weakly-supervised potential negative learning and hard-negative sampling to robustly distinguish positives from negatives. Experiments reveal that standard VMR models overfit to misinformation in MVMR, whereas CroCs significantly improves faithfulness and robustness, even when integrated into realistic video-retrieval pipelines. The work provides public code and datasets to foster further research in trustworthy video-language retrieval systems.

Abstract

With the explosion of multimedia content, video moment retrieval (VMR), which aims to detect a video moment that matches a given text query from a video, has been studied intensively as a critical problem. However, the existing VMR framework evaluates video moment retrieval performance, assuming that a video is given, which may not reveal whether the models exhibit overconfidence in the falsely given video. In this paper, we propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task that aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models. For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets using textual and visual semantic distance verification methods. We extend existing VMR datasets using these methods and newly construct three practical MVMR datasets. To solve the task, we further propose a strong informative sample-weighted learning method, CroCs, which employs two contrastive learning mechanisms: (1) weakly-supervised potential negative learning and (2) cross-directional hard-negative learning. Experimental results on the MVMR datasets reveal that existing VMR models are easily distracted by the misinformation (distractors), whereas our model shows significantly robust performance, demonstrating that CroCs is essential to distinguishing positive moments against distractors. Our code and datasets are publicly available: https://github.com/yny0506/Massive-Videos-Moment-Retrieval.

MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors

TL;DR

This work addresses the faithfulness of Video Moment Retrieval (VMR) when faced with misinformation by introducing Massive Videos Moment Retrieval (MVMR), which retrieves moments for a query from a massive pool containing multiple distractors. It proposes an automated dataset construction framework using semantic distance verification (textual with SimCSE and visual with EMScore) to create three practical MVMR benchmarks, with human validation showing low mislabel rates. To counter misinformation, CroCs—a cross-directional informative sample-weighted contrastive learning method built on MMN—combines weakly-supervised potential negative learning and hard-negative sampling to robustly distinguish positives from negatives. Experiments reveal that standard VMR models overfit to misinformation in MVMR, whereas CroCs significantly improves faithfulness and robustness, even when integrated into realistic video-retrieval pipelines. The work provides public code and datasets to foster further research in trustworthy video-language retrieval systems.

Abstract

With the explosion of multimedia content, video moment retrieval (VMR), which aims to detect a video moment that matches a given text query from a video, has been studied intensively as a critical problem. However, the existing VMR framework evaluates video moment retrieval performance, assuming that a video is given, which may not reveal whether the models exhibit overconfidence in the falsely given video. In this paper, we propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task that aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models. For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets using textual and visual semantic distance verification methods. We extend existing VMR datasets using these methods and newly construct three practical MVMR datasets. To solve the task, we further propose a strong informative sample-weighted learning method, CroCs, which employs two contrastive learning mechanisms: (1) weakly-supervised potential negative learning and (2) cross-directional hard-negative learning. Experimental results on the MVMR datasets reveal that existing VMR models are easily distracted by the misinformation (distractors), whereas our model shows significantly robust performance, demonstrating that CroCs is essential to distinguishing positive moments against distractors. Our code and datasets are publicly available: https://github.com/yny0506/Massive-Videos-Moment-Retrieval.
Paper Structure (40 sections, 13 equations, 8 figures, 3 tables)

This paper contains 40 sections, 13 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 2: Massive Video Pool Construction. From our experiment on the TACoS dataset, if a massive video set is constructed using random sampling, then over 40% of queries include at least one false-negative video, showing the risk of random sampling. Our method constructs a massive video set by filtering trustful positive and negative sets using semantic distance verification methods, considering the possibility of false-negatives. $v^{+}_{i}$ and $v^{-}_{j}$ mean a positive and a negative video, respectively.
  • Figure 3: Examples of constructed MVMR datasets. We visualize positive and negative video sets for a query of the constructed three MVMR datasets. A green solid box means a golden positive moment, and blue solid boxes show moments assigned to videos classified as positive. The underlined queries mean the most similar query described in Section \ref{['ssec:semantic_check']} (max aggregation).
  • Figure 4: Qualitative Analysis for Filtering Methods. We visualize the derived similarity scores of SimCSE and EMScore to verify the constructed MVMR datasets. We use T-SNE to reduce the dimension of each query embedding for displaying each query (dot) of SimCSE Similarity Distribution.
  • Figure 5: CroCs Overview. We adopt the informative sample-weighted mutual matching learning to solve the MVMR task. The dots and triangles are the features of moments and texts. The blue dash line is matched moment-text pairs to be pulled in, while the red dash lines are negative samples of intra/inter-video to be pushed away. The yellow and orange dash lines are unmatched moment-text pairs, but not to train by filtering out since they are easy and false-negatives.
  • Figure 6: VMR vs. MVMR Performance. We report average scores of R1@0.5 and R5@0.5 to reveal the vulnerability of VMR models to misinformation. X and Y axes correspond to the models and the average rank score, respectively.
  • ...and 3 more figures