Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding
Kaijing Ma, Haojian Huang, Jin Chen, Haodong Chen, Pengliang Ji, Xianghao Zang, Han Fang, Chao Ban, Hao Sun, Mulin Chen, Xuelong Li
TL;DR
This work addresses open-world Video Temporal Grounding (VTG) by integrating Evidential Deep Learning (EDL) into a robust VTG framework. The proposed SRAM module employs a two-stage cross-modal alignment with Reflective Flipped Fusion (RFF) blocks and a masked-language-modeling pretraining regime, while a DER-based evidential head quantifies both aleatoric and epistemic uncertainty. To overcome regularizer flaws in vanilla DER, the authors introduce Geom-regularization (Type I and II lines) that adaptively calibrates evidence with prediction error, yielding more trustworthy uncertainty estimates. Extensive experiments across multiple VTG benchmarks demonstrate strong grounding performance and improved uncertainty calibration, including robustness to open-world and adversarial conditions. The approach advances reliable, uncertainty-aware VTG with potential impact on video-language systems and human–computer interactions in noisy or out-of-distribution contexts.
Abstract
Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network module that benefits from a two-stage cross-modal alignment task. More importantly, it integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training, thus allowing the model to say "I do not know" in scenarios beyond its handling capacity. However, the direct application of traditional DER theory and its regularizer reveals structural flaws, leading to unintended constraints in VTG tasks. In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up. To the best of our knowledge, this marks the first successful attempt of DER in VTG. Our extensive quantitative and qualitative results affirm the effectiveness, robustness, and interpretability of our modules and the uncertainty learning paradigm in VTG tasks. The code will be made available.
