Table of Contents
Fetching ...

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu

TL;DR

This work introduces SEED-Bench-R1, a video-understanding benchmark with a three-level generalization hierarchy designed to test perception and reasoning in multimodal LLMs. Using Qwen2-VL-Instruct-7B as a base, it compares reinforcement learning via GRPO against supervised fine-tuning, showing RL is more data-efficient and generalizes better, including to LongVideoBench. The analysis reveals RL improves visual perception and dynamic querying through COT but can yield less coherent reasoning chains and occasionally miss key cues due to perceptual limits. The authors outline future directions to strengthen base reasoning, reward modeling, and RL robustness to noise for scalable multimodal alignment.

Abstract

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

TL;DR

This work introduces SEED-Bench-R1, a video-understanding benchmark with a three-level generalization hierarchy designed to test perception and reasoning in multimodal LLMs. Using Qwen2-VL-Instruct-7B as a base, it compares reinforcement learning via GRPO against supervised fine-tuning, showing RL is more data-efficient and generalizes better, including to LongVideoBench. The analysis reveals RL improves visual perception and dynamic querying through COT but can yield less coherent reasoning chains and occasionally miss key cues due to perceptual limits. The authors outline future directions to strengthen base reasoning, reward modeling, and RL robustness to noise for scalable multimodal alignment.

Abstract

Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.

Paper Structure

This paper contains 14 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) Overview of SEED-Bench-R1 (SBR), which systematically evaluates post-training methods for MLLMs in video understanding. SBR features a three-level evaluation hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with training data containing easily verifiable ground-truth answers, to assess generalization across different levels. These tasks necessitate both perception and reasoning to tackle complex real-world challenges. (b) Notably, an MLLM trained using reinforcement learning via GRPO outperforms both the base model and supervised fine-tuning (SFT) model, particularly in out-of-distribution scenarios (e.g., levels 2–3). Additionally, this RL-trained model exhibits strong generalization capabilities across general video understanding benchmarks (e.g., LongVideoBench).
  • Figure 2: Example questions from the three-level evaluation hierarchy in SEED-Bench-R1’s validation set, including in-distribution, cross-environment, and cross-environment-task scenarios.
  • Figure 3: The variation curves of completion length and accuracy reward w.r.t. RL training steps. While the reward value generally increases during RL, the completion length of the MLLM does not show a significant increase.
  • Figure 4: Comparison of model responses to a Level-1 question from SEED-Bench-R1. The visual input includes 16 sampled frames from a video (showing task progress) and a final observation image. Attention maps (output-to-visual tokens) are shown for each model: the base (Qwen2-VL-7B), GRPO fine-tuned, and SFT fine-tuned versions. The base and SFT models exhibited illusory perceptions (red text), while the GRPO model attended more accurately to visual regions—e.g., correctly identifying cream cheese in the pot (green box) and suggesting the next step (discarding the empty yogurt container). The SFT model’s attention was ineffective (red box), and the base model’s attention was dispersed, impairing judgment.
  • Figure 5: Comparison of model responses to a Level-2 (out-of-distribution, cross-environment) question from SEED-Bench-R1. The GRPO fine-tuned model demonstrates more accurate attention to hand movement (highlighted in the green box). Interestingly, while the GRPO fine-tuned model produces similar incorrect reasoning steps as the SFT-trained model (red text), it ultimately outputs the correct answer by disregarding the flawed semantic reasoning. This suggests that GRPO, with its outcome-supervised reward signal, primarily enhances visual perception but may compromise the logical coherence and semantic accuracy of the model's reasoning process.
  • ...and 3 more figures