Table of Contents
Fetching ...

Process-of-Thought Reasoning for Videos

Jusheng Zhang, Kaitong Cai, Jian Wang, Yongsen Zheng, Kwok-Yan Lam, Keze Wang

TL;DR

This work tackles the gap in video understanding where models describe content but fail to reason about temporal causality. It introduces Process-of-Thought (PoT) Reasoning for Videos, a neuro-symbolic framework that grounds videos into discrete events, builds symbolic reasoning chains via a Discrete CoT Generator, and verifies these chains with a Hybrid Differentiable Verifier. The training objective combines caption quality with four components—predictive utility, temporal logic, counterfactual robustness, and sparsity—enabling the reasoning chain to be differentiable and backpropagable through perception. Empirically, LogicAgent achieves state-of-the-art performance across six video-language benchmarks, demonstrates strong data efficiency in few-shot scenarios, and shows reduced hallucinations due to verifiable reasoning. The approach offers a scalable, interpretable path to robust, causally grounded video understanding with practical implications for safety-critical AI systems.

Abstract

Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.

Process-of-Thought Reasoning for Videos

TL;DR

This work tackles the gap in video understanding where models describe content but fail to reason about temporal causality. It introduces Process-of-Thought (PoT) Reasoning for Videos, a neuro-symbolic framework that grounds videos into discrete events, builds symbolic reasoning chains via a Discrete CoT Generator, and verifies these chains with a Hybrid Differentiable Verifier. The training objective combines caption quality with four components—predictive utility, temporal logic, counterfactual robustness, and sparsity—enabling the reasoning chain to be differentiable and backpropagable through perception. Empirically, LogicAgent achieves state-of-the-art performance across six video-language benchmarks, demonstrates strong data efficiency in few-shot scenarios, and shows reduced hallucinations due to verifiable reasoning. The approach offers a scalable, interpretable path to robust, causally grounded video understanding with practical implications for safety-critical AI systems.

Abstract

Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.
Paper Structure (73 sections, 7 theorems, 30 equations, 7 figures, 14 tables)

This paper contains 73 sections, 7 theorems, 30 equations, 7 figures, 14 tables.

Key Result

Proposition 2.3

Under the $\delta$-discriminative data assumption and the hybrid objective $\mathcal{L}$, the gradient flow guarantees that for any distinct $z_i, z_j \in Z$, the expected inner product of their optimization directions satisfies: implying divergent trajectories and ensuring $\|z_i - z_j\|_2 > 0$ at convergence.

Figures (7)

  • Figure 1: The LogicAgent Pipeline. Unlike black-box models, we explicitly model the narrative process. (1) Eventifier: Lifts continuous features $V$ into grounded events $\mathcal{E}$; (2) CoT Generator: Synthesizes symbolic causal chains $\mathcal{C}$; (3) Hybrid Verifier: Scores chains to guide learning. This verifiable structure prevents gradient isolation and ensures logical consistency.
  • Figure 2: Reasoning Dynamics. (a) Update:$r_t$ boosts causal belief (C1) vs. errors (C2). (b) Kernel:$K_t$ transfers semantics but blocks counterfactuals. (c) Convergence: Temporal ops learn faster than semantic. (d) Matrix: Shows logical disentanglement.
  • Figure 3: Functional Verification. We transform the discrete chain into a differentiable variable via four objectives: (1) Predictive Utility, (2) Logical Consistency, (3) Counterfactual Robustness, and (4) Sparsity, ensuring causal structure learning.
  • Figure 4: POPE results. LogicAgent achieves consistent gains across Random, Popular, and Adversarial settings. The significant improvement in the Adversarial split highlights the robustness of our verifier against misleading visual cues.
  • Figure 5: Hallucination evaluation on MME. Large improvements in Commonsense Reasoning and Count metrics verify that LogicAgent's structured reasoning effectively grounds generation in reality.
  • ...and 2 more figures

Theorems & Definitions (15)

  • Definition 2.1: Operator Collapse
  • Definition 2.2: $\delta$-Discriminative Data Distribution
  • Proposition 2.3: Global Identifiability of Logic Operators
  • Lemma 2.4: Orthogonality of Temporal Constraints
  • proof
  • Lemma 2.5: Semantic Separation via Counterfactual Barrier
  • proof
  • Proposition 3.1: Identifiability up to Isometry
  • proof
  • Lemma 3.2: Conditioning Dependency
  • ...and 5 more