Process-of-Thought Reasoning for Videos
Jusheng Zhang, Kaitong Cai, Jian Wang, Yongsen Zheng, Kwok-Yan Lam, Keze Wang
TL;DR
This work tackles the gap in video understanding where models describe content but fail to reason about temporal causality. It introduces Process-of-Thought (PoT) Reasoning for Videos, a neuro-symbolic framework that grounds videos into discrete events, builds symbolic reasoning chains via a Discrete CoT Generator, and verifies these chains with a Hybrid Differentiable Verifier. The training objective combines caption quality with four components—predictive utility, temporal logic, counterfactual robustness, and sparsity—enabling the reasoning chain to be differentiable and backpropagable through perception. Empirically, LogicAgent achieves state-of-the-art performance across six video-language benchmarks, demonstrates strong data efficiency in few-shot scenarios, and shows reduced hallucinations due to verifiable reasoning. The approach offers a scalable, interpretable path to robust, causally grounded video understanding with practical implications for safety-critical AI systems.
Abstract
Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.
