Table of Contents
Fetching ...

Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding

Kaijing Ma, Haojian Huang, Jin Chen, Haodong Chen, Pengliang Ji, Xianghao Zang, Han Fang, Chao Ban, Hao Sun, Mulin Chen, Xuelong Li

TL;DR

This work addresses open-world Video Temporal Grounding (VTG) by integrating Evidential Deep Learning (EDL) into a robust VTG framework. The proposed SRAM module employs a two-stage cross-modal alignment with Reflective Flipped Fusion (RFF) blocks and a masked-language-modeling pretraining regime, while a DER-based evidential head quantifies both aleatoric and epistemic uncertainty. To overcome regularizer flaws in vanilla DER, the authors introduce Geom-regularization (Type I and II lines) that adaptively calibrates evidence with prediction error, yielding more trustworthy uncertainty estimates. Extensive experiments across multiple VTG benchmarks demonstrate strong grounding performance and improved uncertainty calibration, including robustness to open-world and adversarial conditions. The approach advances reliable, uncertainty-aware VTG with potential impact on video-language systems and human–computer interactions in noisy or out-of-distribution contexts.

Abstract

Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network module that benefits from a two-stage cross-modal alignment task. More importantly, it integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training, thus allowing the model to say "I do not know" in scenarios beyond its handling capacity. However, the direct application of traditional DER theory and its regularizer reveals structural flaws, leading to unintended constraints in VTG tasks. In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up. To the best of our knowledge, this marks the first successful attempt of DER in VTG. Our extensive quantitative and qualitative results affirm the effectiveness, robustness, and interpretability of our modules and the uncertainty learning paradigm in VTG tasks. The code will be made available.

Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding

TL;DR

This work addresses open-world Video Temporal Grounding (VTG) by integrating Evidential Deep Learning (EDL) into a robust VTG framework. The proposed SRAM module employs a two-stage cross-modal alignment with Reflective Flipped Fusion (RFF) blocks and a masked-language-modeling pretraining regime, while a DER-based evidential head quantifies both aleatoric and epistemic uncertainty. To overcome regularizer flaws in vanilla DER, the authors introduce Geom-regularization (Type I and II lines) that adaptively calibrates evidence with prediction error, yielding more trustworthy uncertainty estimates. Extensive experiments across multiple VTG benchmarks demonstrate strong grounding performance and improved uncertainty calibration, including robustness to open-world and adversarial conditions. The approach advances reliable, uncertainty-aware VTG with potential impact on video-language systems and human–computer interactions in noisy or out-of-distribution contexts.

Abstract

Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network module that benefits from a two-stage cross-modal alignment task. More importantly, it integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training, thus allowing the model to say "I do not know" in scenarios beyond its handling capacity. However, the direct application of traditional DER theory and its regularizer reveals structural flaws, leading to unintended constraints in VTG tasks. In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up. To the best of our knowledge, this marks the first successful attempt of DER in VTG. Our extensive quantitative and qualitative results affirm the effectiveness, robustness, and interpretability of our modules and the uncertainty learning paradigm in VTG tasks. The code will be made available.
Paper Structure (35 sections, 36 equations, 19 figures, 7 tables, 1 algorithm)

This paper contains 35 sections, 36 equations, 19 figures, 7 tables, 1 algorithm.

Figures (19)

  • Figure 1: Motivation illustration. Epistemic uncertainty arises mainly from Knowledge Gaps and Semantic Ambiguity in (a) and (b), while aleatoric uncertainty is primarily due to Subjective Annotations and Low-Level Feature Uncertainty in (c) and (d). Specifically, (a) illustrates a critical knowledge gap where the model’s training does not sufficiently cover the possible real-world scenarios, leading to potential failures in understanding and responding to user inputs, while (b) highlights the challenges of semantic ambiguity in both visual and textual contexts. (c) indicates the subjective nature of annotations in datasets, while (d) unfolds the potential challenge led by variations in scene lighting, resolution, jittering, blurring, and transitions etc. Exemplification can be found in Appendix \ref{['ad_noise']} and \ref{['cases_study']}.
  • Figure 2: Conceptual illustration: Conventional models give random responses to OOD queries, unfit for critical decisions. In contrast, our model reliably delivers sensible, informed answers.
  • Figure 3: Overall architecture of the proposed two-stage cross-modal alignment using SRAM. Firstly, an untrimmed video and masked query are encoded with a frozen encoder, then SRAM reconstructs the masked query tokens. In the second stage, SRAM performs temporal grounding on the video using the complete user's query. SRAM includes RFF blocks, an evidential head, a VTG head, and a Masked Language Model (MLM) head. The MLM head enclosed by the dashed box is trained only during the first stage. These components are discussed in the following sections.
  • Figure 4: Dataset bias sensitivity. (a) Joint distributions of the start and end timestamps of the ground-truth moments in the QVHighlights dataset. (b), (c), (d), and (e) show the predicted uncertainty's sensitivity to temporal biases in the dataset under different conditions.
  • Figure 5: Case i of attention map visualization.
  • ...and 14 more figures