AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni; Pooyan Fazli

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni, Pooyan Fazli

TL;DR

The paper tackles the challenge of long-horizon multimodal video reasoning by identifying critical shortcomings of on-policy GRPO, namely data inefficiency, vanishing advantages, and uniform credit assignment. It introduces AVATAR, an off-policy reinforcement learning framework that uses a stratified replay buffer and a hinting mechanism to improve sample efficiency, paired with Temporal Advantage Shaping (TAS) to concentrate learning on early planning and late synthesis steps. A comprehensive reward suite, including R_format, R_acc, R_self, R_judge, and VCRS, alongside a three-stage training curriculum, stabilizes and guides learning for audio-visual alignment and reasoning. Empirical results across multiple audio-visual and video benchmarks show AVATAR delivering substantial gains over strong baselines, with improved data efficiency (up to 80% fewer completions) and robust performance on complex tasks like AV-Odyssey, WorldSense, and MMVU, validating its effectiveness and generality across model scales.

Abstract

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce $\textbf{AVATAR}$ ($\textbf{A}$udio-$\textbf{V}$ideo $\textbf{A}$gen$\textbf{t}$ for $\textbf{A}$lignment and $\textbf{R}$easoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. $\textbf{AVATAR}$ achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by $\mathbf{+5.4}$ on MMVU, $\mathbf{+4.9}$ on OmniBench, and $\mathbf{+4.5}$ on Video-Holmes, while demonstrating $\textbf{$5$$\times$ sample efficiency}$, requiring $80\%$ fewer generated completions to reach target performance.

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

TL;DR

Abstract

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)