Table of Contents
Fetching ...

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu

Abstract

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Group Relative Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods.

EVA: Efficient Reinforcement Learning for End-to-End Video Agent

Abstract

Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Group Relative Policy Optimization (GRPO) - that bridges supervised imitation and reinforcement learning. We further construct high-quality datasets for each stage, supporting stable and reproducible training. We evaluate EVA on six video understanding benchmarks, demonstrating its comprehensive capabilities. Compared with existing baselines, EVA achieves a substantial improvement of 6-12% over general MLLM baselines and a further 1-3% gain over prior adaptive agent methods.
Paper Structure (40 sections, 6 equations, 12 figures, 4 tables)

This paper contains 40 sections, 6 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Given a question that requires the MLLM to figure out action sequences from an extremely long video (over 6600 seconds), the traditional uniformed sampling method is limited by the content length of the MLLM, and it is extremely hard for it to sample all the keyframes to answer the question correctly. As for the Traditional Agentic Method, the agent will also be given the uniformly sampled frames along with the video, which already occupy a lot of context. Although the agent can call tools to extract frames from a specific time range, the tool is rigid and the agent cannot adjust the fps and resolution, which leads to potential information loss. However, in EVA, the agent can arrange the tokens wisely. It can first watch the whole video with low resolution and high fps to get an overview of the video without costing too many visual tokens. After it finds the key time range, it will extract frames with high fps and high resolution, which leads to the correct answer.
  • Figure 2: Data Pipeline and Training Stage of EVA. The base model is first fine-tuned on synthetic dataset with certain reasoning and tool-calling pattern. Then we use KTO to help the model learn from typical failures. Finally, we introduce a Data-Enhanced Multi-Stage GRPO training pipeline, where we collect the failure cases of current policy and employ an teacher MLLM to generate new open-ended video QA dataset.
  • Figure 3: Distribution of the Training Dataset
  • Figure 4: Distribution of Rounds and Visual Token cross Models and Benchmarks
  • Figure 5: Ablation study on the GRPO training dataset. The comparison between multi-choice (MC) only, open-ended (OE) only, and mixed (MC+OE) data shows that mixed data provides a more effective learning environment for the agent, which leads to better performance on VideoMME.
  • ...and 7 more figures