Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Qile Su; Jing Tang; Rui Chen; Lei Sun; Xiangxiang Chu

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu

Abstract

Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose \textbf{C}hain \textbf{o}f \textbf{E}vents (\textbf{CoE}) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Abstract

Paper Structure (26 sections, 10 equations, 18 figures, 6 tables)

This paper contains 26 sections, 10 equations, 18 figures, 6 tables.

Introduction
Related Works
Video Event Prediction
Visual Large Language Models for Reasoning
Evaluation and Analysis of MLLMs on VEP
Method
Chain of Events (CoE) Paradigm
CoE with Supervised Fine-Tuning
CoE with Group Relative Policy Optimization
Training
Experiments
Setup
Main Results
Ablation Study
Training Curves
...and 11 more sections

Figures (18)

Figure 1: Analysis of open-source MLLMs on video event prediction tasks. \ref{['fig:subfig1']} illustrates the reasoning process, indicating the lack of logical reasoning capabilities in VEP task. \ref{['fig:subfig2']} illustrates the attention distribution of the option tokens over input tokens demonstrating the insufficient utilization of visual information.
Figure 2: An illustration of our proposed CoE-SFT method within Qwen2.5-VL-72B. We provide the larger model with the video and the future event, and prompt it to generate the intermediate logical reasoning process that connects them. Training on such data encourages the model to develop logical reasoning abilities rather than relying on option-based analysis.
Figure 3: An illustration of our proposed CoE-GRPO method. The overall supervision signal consists of three components: $r_e$ encourages the model to follow the CoE reasoning paradigm and constrains the CoE length; $r_s$ supervises the alignment between event timestamps and textual descriptions while preventing reward hacking; and $r_a$ provides verifiable reward signals. The scissor icon indicates the temporal segmentation of video clips based on timestamps.
Figure 4: Attention difference of visual tokens between different methods and the base model. Portions greater than 0 indicate an improvement in attention.
Figure 5: The training curves of CoE-GRPO.
...and 13 more figures

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Abstract

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

Authors

Abstract

Table of Contents

Figures (18)