Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives
Ji-jun Park, Soo-joon Choi
TL;DR
The paper tackles the deficit of large vision-language models in capturing causal and temporal dynamics in video narratives. It introduces a Causal-Temporal Reasoning Module (CTRM) composed of a Causal Dynamics Encoder (CDE) and a Temporal Relational Learner (TRL) integrated into LVLM backbones, and adopts a three-stage learning framework combining pre-training, fine-tuning with causal-temporal supervision, and contrastive alignment. Key contributions include the CTRM design, explicit causal/temporal losses, and a dataset-agnostic approach that yields state-of-the-art results on MSVD and MSR-VTT with improved human judgments. The findings demonstrate enhanced narrative coherence and temporal consistency, suggesting practical impact for real-world video understanding and captioning tasks, with avenues for multilingual and domain-specific extensions.
Abstract
Video captioning is a critical task in the field of multimodal machine learning, aiming to generate descriptive and coherent textual narratives for video content. While large vision-language models (LVLMs) have shown significant progress, they often struggle to capture the causal and temporal dynamics inherent in complex video sequences. To address this limitation, we propose an enhanced framework that integrates a Causal-Temporal Reasoning Module (CTRM) into state-of-the-art LVLMs. CTRM comprises two key components: the Causal Dynamics Encoder (CDE) and the Temporal Relational Learner (TRL), which collectively encode causal dependencies and temporal consistency from video frames. We further design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets, fine-tuning on causally annotated data, and contrastive alignment for better embedding coherence. Experimental results on standard benchmarks such as MSVD and MSR-VTT demonstrate that our method outperforms existing approaches in both automatic metrics (CIDEr, BLEU-4, ROUGE-L) and human evaluations, achieving more fluent, coherent, and relevant captions. These results validate the effectiveness of our approach in generating captions with enriched causal-temporal narratives.
