Table of Contents
Fetching ...

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Songyuan Yang, Weijiang Yu, Jilin Ma, Ziyu Liu, Guijian Tang, Wenjing Yang, Huibin Tan, Nong Xiao

Abstract

Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Abstract

Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

Paper Structure

This paper contains 13 sections, 19 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison between Traditional and RLER Inference. Single pass inference outputs an answer without verification and may miss key evidence. RLER produces structured outputs, scores evidence across candidates, aggregates by evidence, and performs a refutation check to deliver credible answers.
  • Figure 2: Method Overview. RLER-Training uses GRPO with Frame-sensitive, Think-transparency, and Anti-repetition Rewards to teach the model to emit structured outputs with keyframe citations and potentiate reasoning capabilities. RLER-Inference applies diverse inputs to produce multiple candidates, parse structure, score evidence, aggregate by evidence weights, and run refutation check.
  • Figure 3: A case study show how RLER uses diverse inputs to form structured candidates, scores evidence, aggregates robustly, and performs refutation verification to revise the initial answer and deliver the final result.
  • Figure 4: An example of emergence elicited by RLER-Training. The red text marks the "aha moment", where the model identifies an internal conflict and initiates a re-evaluation.