Selective Query-guided Debiasing for Video Corpus Moment Retrieval

Sunjae Yoon; Ji Woo Hong; Eunseop Yoon; Dahyun Kim; Junyeong Kim; Hee Suk Yoon; Chang D. Yoo

Selective Query-guided Debiasing for Video Corpus Moment Retrieval

Sunjae Yoon, Ji Woo Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Chang D. Yoo

TL;DR

The paper tackles retrieval bias in video corpus moment retrieval by proposing SQuiDNet, a selective debiasing framework. It jointly trains Naive Moment Retrieval (NMR) for accurate query-video alignment and Biased Moment Retrieval (BMR) to uncover bias from object words, then uses Selective Query-guided Debiasing (SQuiD) to decide when to leverage or counteract this bias based on the query meaning. Central contributions include the Co-occurrence Table and Learnable Confounder to guide bias usage, plus a shared Modality Matching Attention backbone for cross-modal fusion. Experiments on TVR, ActivityNet, and DiDeMo show state-of-the-art results and strong ablations, with qualitative analyses highlighting improved interpretability. The approach offers practical benefits for robust, explainable VCMR in large video corpora.

Abstract

Video moment retrieval (VMR) aims to localize target moments in untrimmed videos pertinent to a given textual query. Existing retrieval systems tend to rely on retrieval bias as a shortcut and thus, fail to sufficiently learn multi-modal interactions between query and video. This retrieval bias stems from learning frequent co-occurrence patterns between query and moments, which spuriously correlate objects (e.g., a pencil) referred in the query with moments (e.g., scene of writing with a pencil) where the objects frequently appear in the video, such that they converge into biased moment predictions. Although recent debiasing methods have focused on removing this retrieval bias, we argue that these biased predictions sometimes should be preserved because there are many queries where biased predictions are rather helpful. To conjugate this retrieval bias, we propose a Selective Query-guided Debiasing network (SQuiDNet), which incorporates the following two main properties: (1) Biased Moment Retrieval that intentionally uncovers the biased moments inherent in objects of the query and (2) Selective Query-guided Debiasing that performs selective debiasing guided by the meaning of the query. Our experimental results on three moment retrieval benchmarks (i.e., TVR, ActivityNet, DiDeMo) show the effectiveness of SQuiDNet and qualitative analysis shows improved interpretability.

Selective Query-guided Debiasing for Video Corpus Moment Retrieval

TL;DR

Abstract

Paper Structure (28 sections, 12 equations, 6 figures, 3 tables)

This paper contains 28 sections, 12 equations, 6 figures, 3 tables.

Introduction
Related Work
Video Moment Retrieval
Causal Reasoning in Vision-Language
Method
Selective Query-guided Debiasing Network
Input Representations
Video Representation.
Text Representation.
Modality Matching Attention
Biased Moment Retrieval and Naive Moment Retrieval
Video-subtitle Matching
Video-query Matching
Conditional Moment Prediction
Selective Query-guided Debiasing
...and 13 more sections

Figures (6)

Figure 1: VCMR training and inference. The biased annotations in training dataset make retrieval bias, which causes biased moment prediction in the inference.
Figure 2: (a) All predictions' tIoU score joint plot between biased model and current model shows correlations between two models, (b) object ('television')-predicate co-occurrence distribution for all queries shows predominant predicate word ('watch'), (c) exemplifies queries where the retrieval bias ('television'-'scene of watching television') serves as 'good bias' or 'bad bias' from statistics in (b).
Figure 3: SQuiDNet is composed of 3 modules: (a) BMR which reveals biased retrieval, (b) NMR which performs accurate retrieval, (c) SQuiD which removes bad biases from accurate retrieval of NMR subject to the meaning of query.
Figure 4: Accuracy according to top-k objects for Co-occurrence table and variational k for Learnable confounder
Figure 5: Visualization of word-level query-video similarities in GT moment. Upper box is results from SQuiDNet trained without BMR and lower box is results from SQuiDNet with BMR. It can be observed that the BMR enables the network to learn the uncommon predicate "kicks" of the object "chair" while also strengthens the learning of the spuriously correlated predicate "sits."
...and 1 more figures

Selective Query-guided Debiasing for Video Corpus Moment Retrieval

TL;DR

Abstract

Selective Query-guided Debiasing for Video Corpus Moment Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (6)