Table of Contents
Fetching ...

Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval

Haojian Huang, Kaijing Ma, Jin Chen, Haodong Chen, Zhou Wu, Xianghao Zang, Han Fang, Chao Ban, Hao Sun, Mulin Chen, Zhongjiang He

TL;DR

The paper tackles the instability of moment retrieval under uncertain and ambiguous moments by extending Deep Evidential Regression to MR and addressing multimodal bias with a Reflective Flipped Fusion block, a query reconstruction task, and a Geom-Regularizer. The Debiased Evidential Learning for Moment Retrieval (DEMR) framework yields improved uncertainty calibration, cross-modal alignment, and retrieval accuracy on debiased and standard MR benchmarks. Comprehensive experiments and ablations demonstrate the effectiveness, robustness, and interpretability of the approach, with strong results against state-of-the-art methods and clear evidence of reduced modality bias. The work advances trustworthy, uncertainty-aware MR and sets a foundation for integration with larger multimodal models in video understanding tasks.

Abstract

In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER's heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval. The code is publicly available at https://github.com/KaijingOfficial/DEMR.

Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval

TL;DR

The paper tackles the instability of moment retrieval under uncertain and ambiguous moments by extending Deep Evidential Regression to MR and addressing multimodal bias with a Reflective Flipped Fusion block, a query reconstruction task, and a Geom-Regularizer. The Debiased Evidential Learning for Moment Retrieval (DEMR) framework yields improved uncertainty calibration, cross-modal alignment, and retrieval accuracy on debiased and standard MR benchmarks. Comprehensive experiments and ablations demonstrate the effectiveness, robustness, and interpretability of the approach, with strong results against state-of-the-art methods and clear evidence of reduced modality bias. The work advances trustworthy, uncertainty-aware MR and sets a foundation for integration with larger multimodal models in video understanding tasks.

Abstract

In the domain of moment retrieval, accurately identifying temporal segments within videos based on natural language queries remains challenging. Traditional methods often employ pre-trained models that struggle with fine-grained information and deterministic reasoning, leading to difficulties in aligning with complex or ambiguous moments. To overcome these limitations, we explore Deep Evidential Regression (DER) to construct a vanilla Evidential baseline. However, this approach encounters two major issues: the inability to effectively handle modality imbalance and the structural differences in DER's heuristic uncertainty regularizer, which adversely affect uncertainty estimation. This misalignment results in high uncertainty being incorrectly associated with accurate samples rather than challenging ones. Our observations indicate that existing methods lack the adaptability required for complex video scenarios. In response, we propose Debiased Evidential Learning for Moment Retrieval (DEMR), a novel framework that incorporates a Reflective Flipped Fusion (RFF) block for cross-modal alignment and a query reconstruction task to enhance text sensitivity, thereby reducing bias in uncertainty estimation. Additionally, we introduce a Geom-regularizer to refine uncertainty predictions, enabling adaptive alignment with difficult moments and improving retrieval accuracy. Extensive testing on standard datasets and debiased datasets ActivityNet-CD and Charades-CD demonstrates significant enhancements in effectiveness, robustness, and interpretability, positioning our approach as a promising solution for temporal-semantic robustness in moment retrieval. The code is publicly available at https://github.com/KaijingOfficial/DEMR.

Paper Structure

This paper contains 12 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of MR methods: (a) Deterministic methods (e.g.lin2023univtg) are overconfident with limited evidence, using Non-Maximum Suppression (NMS) but still failing on challenging frames; (b) Vanilla evidential methods consider uncertainty but produce biased estimates on hard samples; (c) Our method adaptively aligns with challenging semantics for accurate uncertainty modeling and improved inference. Yellow regions denotes uncertainty predictions.
  • Figure 2: Comparison of the baseline (a) and our improved model (b) for the MR task. In (a), the baseline exhibits weak sensitivity to text, as the overlap between the MR task and DER objective causes over-reliance on visual features, while the vanilla DER regularizer leads to unreliable uncertainty estimates. In (b), our RFF block and QR head enhance cross-modal interaction and text sensitivity, and the Geom-regularizer corrects structural flaws in DER for more reliable uncertainty estimation.
  • Figure 3: Gradient field comparison. (a) Vanilla regularizer applies penalties based solely on error, decreasing evidence as error increases. (b) Our Geom-regularizer modulates penalties dynamically based on error magnitude and evidence levels. Our approach reflects the principle that accurate predictions should have higher evidence, while evidence should be suppressed for less accurate predictions.
  • Figure 4: Parameters Analysis on QVHighlights val split. We examined the change of MAP. (a) Evaluate the effectiveness of our proposed Geom-regularizer (left) and der loss (right) under different weights. (b) Demonstrates the impact of the query reconstruction task at different epochs (left) and learning rates (right).
  • Figure 5: Uncertainty KDE over differect noise level. Using Gaussian kernel density estimation (KDE), we plotted the uncertainty distribution for the QVHighlighte val set.
  • ...and 4 more figures