Table of Contents
Fetching ...

GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

Zixu Cheng, Da Li, Jian Hu, Ziquan Liu, Wei Li, Shaogang Gong

TL;DR

GraphThinker tackles hallucinations in video reasoning by introducing Event-based Video Scene Graphs (EVSG) that encode intra-event and inter-event relations, constructed end-to-end by an MLLM. It then uses GRPO-based reinforcement finetuning with a visual attention reward to promote visual grounding and adherence to the EVSG-guided reasoning. The method achieves state-of-the-art performance on RexTime and VidHalluc, with substantial improvements in both temporal localization and reduction of hallucinations. This work demonstrates the value of explicit event-level structure and targeted rewards for robust, grounded video reasoning in multimodal models.

Abstract

Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.

GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

TL;DR

GraphThinker tackles hallucinations in video reasoning by introducing Event-based Video Scene Graphs (EVSG) that encode intra-event and inter-event relations, constructed end-to-end by an MLLM. It then uses GRPO-based reinforcement finetuning with a visual attention reward to promote visual grounding and adherence to the EVSG-guided reasoning. The method achieves state-of-the-art performance on RexTime and VidHalluc, with substantial improvements in both temporal localization and reduction of hallucinations. This work demonstrates the value of explicit event-level structure and targeted rewards for robust, grounded video reasoning in multimodal models.

Abstract

Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.
Paper Structure (15 sections, 12 equations, 4 figures, 4 tables)

This paper contains 15 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Current MLLMs bai2025qwen2 implicitly model event relations in video by token correlations. This often leads to hallucinations in video reasoning. For instance, when determining the temporal order of events, QwenVL-2.5 tends to yield temporal hallucinations due to a lack of explicit event-level evidence validation and reasoning. In contrast, our GraphThinker explicitly models both intra- and inter-event relations in accordance with selective visual attentions in a video to constrain reasoning under these structured event relations anchored by fine-grained visual evidence, reducing hallucinations and improving temporal consistency in reasoning.
  • Figure 2: An overview of the GraphThinker video reasoning model. GraphThinker first employs an MLLM to generate multi-grained dense captions in sequence for a video. It deploys the same MLLM again to select keywords as graph nodes before iteratively optimizing them in constructing an event-based graph (EVSG). This EVSG is then used to serve as a fine-grained representation of structured event relations for reasoning. Given EVSG, we further develop an Event Graph-based RL post-training method with a visual attention reward in a reinforcement learning process conditioned by EVSG for selecting more attentive visual evidence. Together, GraphThinker achieves visually more grounded and temporally more consistent video reasoning.
  • Figure 3: An Example of the proposed Event-based Video Scene Graph (EVSG) for video reasoning. The EVSG is composed of event subgraphs derived from event-level captions, with each subgraph corresponding to start–end timestamps and a set of triplets representing object interactions and spatial relationships to capture intra-event semantics. event subgraphs are sequentially linked by timestamp-based edges, forming a hierarchical structure that explicitly models both intra-event and inter-event relations for temporally consistent reasoning.
  • Figure 4: A visual example showing that our method reduces hallucination during reasoning. QwenVL-2.5 still yields hallucination in reasoning during the inference stage. Our method explicitly models event relations to guide the reasoning process for MLLM, producing visually grounded and temporally consistent reasoning.