Cross-modal Causal Relation Alignment for Video Question Grounding
Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, Liang Lin
TL;DR
VideoQG often relies on spurious cross-modal correlations between visual content and QA signals, harming causal consistency between answering and grounding. The paper introduces Cross-modal Causal Relation Alignment (CRA), a framework combining Gaussian Smoothing Grounding (GSG) for temporal interval estimation, Cross-Modal Alignment (CMA) for bidirectional multimodal representation learning, and Explicit Causal Intervention (ECI) with back-door (linguistic) and front-door (visual mediator) interventions to deconfound predictions. The approach jointly optimizes faithful answering and temporally grounded evidence under weak supervision, leveraging structural causal models with $V$, $L$, $t$, $M$, and $Z$ constructs, and using NWGM to fuse effects. Empirical results on NextGQA and STAR show CRA achieves superior Acc@GQA and IoU@0.5, with ablations confirming the critical roles of CMA and ECI in mitigating biases and improving causal consistency. These findings highlight the practical value of integrating causal reasoning into multimodal grounding for robust VideoQG, with code released for reproducibility.
Abstract
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.
