Table of Contents
Fetching ...

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni, Pooyan Fazli

TL;DR

The paper tackles the challenge of long-horizon multimodal video reasoning by identifying critical shortcomings of on-policy GRPO, namely data inefficiency, vanishing advantages, and uniform credit assignment. It introduces AVATAR, an off-policy reinforcement learning framework that uses a stratified replay buffer and a hinting mechanism to improve sample efficiency, paired with Temporal Advantage Shaping (TAS) to concentrate learning on early planning and late synthesis steps. A comprehensive reward suite, including R_format, R_acc, R_self, R_judge, and VCRS, alongside a three-stage training curriculum, stabilizes and guides learning for audio-visual alignment and reasoning. Empirical results across multiple audio-visual and video benchmarks show AVATAR delivering substantial gains over strong baselines, with improved data efficiency (up to 80% fewer completions) and robust performance on complex tasks like AV-Odyssey, WorldSense, and MMVU, validating its effectiveness and generality across model scales.

Abstract

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $\textbf{AVATAR}$ ($\textbf{A}$udio-$\textbf{V}$ideo $\textbf{A}$gen$\textbf{t}$ for $\textbf{A}$lignment and $\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. $\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\mathbf{+5.4}$ on MMVU, $\mathbf{+4.9}$ on OmniBench, and $\mathbf{+4.5}$ on Video-Holmes, while demonstrating $\textbf{$5$$\times$ sample efficiency}$, requiring $80\%$ fewer generated completions to reach target performance.

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

TL;DR

The paper tackles the challenge of long-horizon multimodal video reasoning by identifying critical shortcomings of on-policy GRPO, namely data inefficiency, vanishing advantages, and uniform credit assignment. It introduces AVATAR, an off-policy reinforcement learning framework that uses a stratified replay buffer and a hinting mechanism to improve sample efficiency, paired with Temporal Advantage Shaping (TAS) to concentrate learning on early planning and late synthesis steps. A comprehensive reward suite, including R_format, R_acc, R_self, R_judge, and VCRS, alongside a three-stage training curriculum, stabilizes and guides learning for audio-visual alignment and reasoning. Empirical results across multiple audio-visual and video benchmarks show AVATAR delivering substantial gains over strong baselines, with improved data efficiency (up to 80% fewer completions) and robust performance on complex tasks like AV-Odyssey, WorldSense, and MMVU, validating its effectiveness and generality across model scales.

Abstract

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce (udio-ideo gen for lignment and easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by on MMVU, on OmniBench, and on Video-Holmes, while demonstrating 5$ sample efficiency}80\%$ fewer generated completions to reach target performance.

Paper Structure

This paper contains 48 sections, 11 equations, 15 figures, 10 tables, 1 algorithm.

Figures (15)

  • Figure 1: Standard GRPO (top) vs. AVATAR (bottom). AVATAR enhances GRPO with two key components: (1) an off-policy architecture using a stratified replay buffer to improve data efficiency, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that focuses learning on critical reasoning steps.
  • Figure 2: To address GRPO’s uniform credit assignment (gray line), TAS applies a parabolic weighting function to amplify advantages during crucial planning and synthesis stages.
  • Figure 3: Three-stage RL training pipeline to evaluate AVATAR. The framework advances from Cold start SFT (Stage 0) to Visual Reasoning (Stage 1) to Audio-Visual Reasoning (Stage 2) to Audio-Object Localization (Stage 3).
  • Figure 4: Comparison of training dynamics between GRPO and AVATAR. GRPO (a) shows oscillatory and unstable accuracy reward progression, whereas AVATAR (b) demonstrates a smoother, more consistent learning trajectory.
  • Figure 5: Effect of reasoning sequence length on TAS performance. In both (a) audio-visual and (b) video reasoning benchmarks, TAS yields greater gains with longer reasoning sequences.
  • ...and 10 more figures