Uncovering Hidden Challenges in Query-Based Video Moment Retrieval
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä
TL;DR
This paper reveals that biases in standard video moment retrieval benchmarks can drive model performance more than actual video understanding. It introduces blind baselines and sanity checks to quantify reliance on priors, showing that many models perform comparably or better without using visual input, especially on Charades-STA, while highlighting human annotator disagreement. The authors propose alternative evaluation metrics and dataset augmentation strategies to mitigate biases and better reflect true temporal grounding capability. Overall, the work calls for more robust baselines, diverse data, and fairer evaluation to accurately measure progress in temporal sentence grounding.
Abstract
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR .
