Table of Contents
Fetching ...

Cross-modal Causal Relation Alignment for Video Question Grounding

Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, Liang Lin

TL;DR

VideoQG often relies on spurious cross-modal correlations between visual content and QA signals, harming causal consistency between answering and grounding. The paper introduces Cross-modal Causal Relation Alignment (CRA), a framework combining Gaussian Smoothing Grounding (GSG) for temporal interval estimation, Cross-Modal Alignment (CMA) for bidirectional multimodal representation learning, and Explicit Causal Intervention (ECI) with back-door (linguistic) and front-door (visual mediator) interventions to deconfound predictions. The approach jointly optimizes faithful answering and temporally grounded evidence under weak supervision, leveraging structural causal models with $V$, $L$, $t$, $M$, and $Z$ constructs, and using NWGM to fuse effects. Empirical results on NextGQA and STAR show CRA achieves superior Acc@GQA and IoU@0.5, with ablations confirming the critical roles of CMA and ECI in mitigating biases and improving causal consistency. These findings highlight the practical value of integrating causal reasoning into multimodal grounding for robust VideoQG, with code released for reproducibility.

Abstract

Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.

Cross-modal Causal Relation Alignment for Video Question Grounding

TL;DR

VideoQG often relies on spurious cross-modal correlations between visual content and QA signals, harming causal consistency between answering and grounding. The paper introduces Cross-modal Causal Relation Alignment (CRA), a framework combining Gaussian Smoothing Grounding (GSG) for temporal interval estimation, Cross-Modal Alignment (CMA) for bidirectional multimodal representation learning, and Explicit Causal Intervention (ECI) with back-door (linguistic) and front-door (visual mediator) interventions to deconfound predictions. The approach jointly optimizes faithful answering and temporally grounded evidence under weak supervision, leveraging structural causal models with , , , , and constructs, and using NWGM to fuse effects. Empirical results on NextGQA and STAR show CRA achieves superior Acc@GQA and IoU@0.5, with ablations confirming the critical roles of CMA and ECI in mitigating biases and improving causal consistency. These findings highlight the practical value of integrating causal reasoning into multimodal grounding for robust VideoQG, with code released for reproducibility.

Abstract

Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.

Paper Structure

This paper contains 23 sections, 13 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: (a) A typical example of a VideoQG task, which adopts the erroneous grounding and leads to the correct but unfaithful answer. (b) shows the occurrence number of different answers from the questions that mention "baby" and "woman". (c) shows the distribution of the ratio between the video segment and full video in the Test set and Val Set.
  • Figure 2: An overview of our CRA framework, and the above shows our proposed SCM in CRA. (a) It extracts video and linguistics features separately. (b) A Temporal Encoder is used to fuse temporal information and the Linguistics Causal Intervention Module mitigates the bias from the QA feature using the semantic structure graphs as confounders $\widetilde{L}$. (c) Our Gaussian Smoothing Attention Grounding module estimates the cross-modal attention to refine the video feature, and then the average visual feature $\bar{V}$, grounded visual feature $M$, and the pre-processed visual feature clusters $\widetilde{V}$ are provided for the Explicit Causal Intervention Module in (d). Finally, the cross-entropy loss is computed for $a$, and bidirectional contrastive learning is applied to the selected positive and negative multi-modal samples for CMA.
  • Figure 3: (a) The Gaussian Smoothing Grounding Module and the Multi-modal Causal Intervention Module are presented that consisting of (b) the back-door intervention module and (c) the Explicit intervention module, where $\widetilde{L}$ is the semantic graph constructed by Stanza qi2020stanza and $\widetilde{V}$ is constructed from all frames in the training set.
  • Figure 4: (a) shows the distribution of segment length of CRA, Temp[CLIP] (NG+), and Ground Truth on NextGQA dataset. (b) shows the distribution of segment ratio of CRA, Temp[CLIP] (NG+), and Ground Truth. The hierarchical bin can be compared intuitively.
  • Figure 5: Visualization examples in NextGQA dataset (a) and STAR dataset (b). The numbers ([start time, end time]) indicate the interval.
  • ...and 2 more figures