Table of Contents
Fetching ...

Structured Causal Video Reasoning via Multi-Objective Alignment

Zinuo Li, Yongxin Guo, Jun Liu, Jiawei Zhan, Xi Jiang, Chengjie Wang, Mohammed Bennamoun, Farid Boussaid, Feng Zheng, Qiuhong Ke

Abstract

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

Structured Causal Video Reasoning via Multi-Objective Alignment

Abstract

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

Paper Structure

This paper contains 46 sections, 14 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Compared with existing video reasoning approaches, our model first extracts facts event information from videos. It then applies a thinking process strictly constrained by causal relationships, which is specifically optimized for video data. This method clarifies critical information and enhances interpretability while focusing on the temporal dimension of videos. Video is sampled from ActivityNet-Captions krishna2017dense.
  • Figure 2: The complete reasoning pipeline of our model. Given a video, we first establish event-level facts information. This step highlights all critical clues, such as time, person, and human action. These clues constrain the subsequent thinking process, enabling the model to reason logically based on evidence while focusing on temporal causal relationships. Video is sampled from ActivityNet-Captions krishna2017dense.
  • Figure 3: Overview of the two-stage pipeline for constructing from VTG datasets. Stage 1 performs video filtering and gap filling, generates structured facts captions, and applies an automatic quality judge with random human inspection; low-quality samples are rejected and iteratively refined. Stage 2 produces causally grounded reasoning traces, followed by a second quality-judging and human spot-checking step, yielding the final curated causal–facts set.
  • Figure 4: GRPO vs P-FAB advantage comparison. P-FAB dynamically adjusts weights by solving a minimum-norm problem in the standardized reward space, ensuring that rare but critical signals are not overwhelmed by high-variance conflicting objectives.
  • Figure 5: Distribution of Video Sources. Our dataset comprises 32,049 videos selected from high-quality VTG benchmarks. We utilize the precise human-annotated timestamps from these sources while regenerating the textual content to align with our structured event schema.
  • ...and 9 more figures