Selective Query-guided Debiasing for Video Corpus Moment Retrieval
Sunjae Yoon, Ji Woo Hong, Eunseop Yoon, Dahyun Kim, Junyeong Kim, Hee Suk Yoon, Chang D. Yoo
TL;DR
The paper tackles retrieval bias in video corpus moment retrieval by proposing SQuiDNet, a selective debiasing framework. It jointly trains Naive Moment Retrieval (NMR) for accurate query-video alignment and Biased Moment Retrieval (BMR) to uncover bias from object words, then uses Selective Query-guided Debiasing (SQuiD) to decide when to leverage or counteract this bias based on the query meaning. Central contributions include the Co-occurrence Table and Learnable Confounder to guide bias usage, plus a shared Modality Matching Attention backbone for cross-modal fusion. Experiments on TVR, ActivityNet, and DiDeMo show state-of-the-art results and strong ablations, with qualitative analyses highlighting improved interpretability. The approach offers practical benefits for robust, explainable VCMR in large video corpora.
Abstract
Video moment retrieval (VMR) aims to localize target moments in untrimmed videos pertinent to a given textual query. Existing retrieval systems tend to rely on retrieval bias as a shortcut and thus, fail to sufficiently learn multi-modal interactions between query and video. This retrieval bias stems from learning frequent co-occurrence patterns between query and moments, which spuriously correlate objects (e.g., a pencil) referred in the query with moments (e.g., scene of writing with a pencil) where the objects frequently appear in the video, such that they converge into biased moment predictions. Although recent debiasing methods have focused on removing this retrieval bias, we argue that these biased predictions sometimes should be preserved because there are many queries where biased predictions are rather helpful. To conjugate this retrieval bias, we propose a Selective Query-guided Debiasing network (SQuiDNet), which incorporates the following two main properties: (1) Biased Moment Retrieval that intentionally uncovers the biased moments inherent in objects of the query and (2) Selective Query-guided Debiasing that performs selective debiasing guided by the meaning of the query. Our experimental results on three moment retrieval benchmarks (i.e., TVR, ActivityNet, DiDeMo) show the effectiveness of SQuiDNet and qualitative analysis shows improved interpretability.
