Table of Contents
Fetching ...

Uncovering Hidden Challenges in Query-Based Video Moment Retrieval

Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä

TL;DR

This paper reveals that biases in standard video moment retrieval benchmarks can drive model performance more than actual video understanding. It introduces blind baselines and sanity checks to quantify reliance on priors, showing that many models perform comparably or better without using visual input, especially on Charades-STA, while highlighting human annotator disagreement. The authors propose alternative evaluation metrics and dataset augmentation strategies to mitigate biases and better reflect true temporal grounding capability. Overall, the work calls for more robust baselines, diverse data, and fairer evaluation to accurately measure progress in temporal sentence grounding.

Abstract

The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR .

Uncovering Hidden Challenges in Query-Based Video Moment Retrieval

TL;DR

This paper reveals that biases in standard video moment retrieval benchmarks can drive model performance more than actual video understanding. It introduces blind baselines and sanity checks to quantify reliance on priors, showing that many models perform comparably or better without using visual input, especially on Charades-STA, while highlighting human annotator disagreement. The authors propose alternative evaluation metrics and dataset augmentation strategies to mitigate biases and better reflect true temporal grounding capability. Overall, the work calls for more robust baselines, diverse data, and fairer evaluation to accurately measure progress in temporal sentence grounding.

Abstract

The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR .

Paper Structure

This paper contains 14 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Moment retrieval finds the moment in a video corresponding to a query sentence.
  • Figure 2: Top-30 frequent actions of each dataset.
  • Figure 3: Distributions of temporal locations of target moments. Color represents values of probability density function. The top three plots are the distributions for Charades-STA, and the bottom three are for ActivityNet Captions. For each dataset, the left distribution is produced for all moments, while the other two distributions are moments described by a certain verbs. More examples can be found in the supplementary material.
  • Figure 4: R@1 (IoU$>$0.5) scores on Charades-STA (left) and ActivityNet Captions (right). Highlighted bars indicate blind baselines. Surprisingly, the blind baselines outperform many deep models and reach close to the state-of-the-art on ActivityNet Captions.
  • Figure 5: R@1(IoU$>$0.5) scores for 2D-TAN 2DTAN_2020_AAAI and SCDM yuan2019semantic when the original input videos and randomized ones are fed into these models.
  • ...and 8 more figures