Table of Contents
Fetching ...

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Sanjoy Chowdhury, Karren D. Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli

TL;DR

The paper addresses the challenge of agentic, multi-speaker understanding in multimodal video data by introducing AMusE, a benchmark capturing six audio-visual reasoning tasks across zero-shot, guided, and agentic modes. It then presents RAFT, a data-efficient Reasoning–Acting–Feedback training framework that combines Reflective Reward Optimization with Selective Reasoning Adaptation to improve cross-modal planning, grounding, and temporal coherence. Empirical results show substantial gains, including up to 39.52% relative accuracy improvement and notable BLEU/METEOR/CIDEr improvements, with open-source models benefiting significantly and approaching closed-source performance under RAFT. Together, AMusE and RAFT provide a practical platform to study and improve agentic multimodal reasoning for real-world multi-speaker scenarios.

Abstract

Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52\% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

TL;DR

The paper addresses the challenge of agentic, multi-speaker understanding in multimodal video data by introducing AMusE, a benchmark capturing six audio-visual reasoning tasks across zero-shot, guided, and agentic modes. It then presents RAFT, a data-efficient Reasoning–Acting–Feedback training framework that combines Reflective Reward Optimization with Selective Reasoning Adaptation to improve cross-modal planning, grounding, and temporal coherence. Empirical results show substantial gains, including up to 39.52% relative accuracy improvement and notable BLEU/METEOR/CIDEr improvements, with open-source models benefiting significantly and approaching closed-source performance under RAFT. Together, AMusE and RAFT provide a practical platform to study and improve agentic multimodal reasoning for real-world multi-speaker scenarios.

Abstract

Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52\% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.

Paper Structure

This paper contains 50 sections, 11 equations, 15 figures, 27 tables.

Figures (15)

  • Figure 1: AMusE task definition. The benchmark includes six high-level audio-visual tasks in realistic multi-speaker settings. Each task requires integrating core skills involve spatial and temporal grounding, speaker identification, speech recognition, and summarization. denotes a textual description of the audio scene for reader clarity.
  • Figure 2: Evaluation Protocols.Zero-Shot, Guided, and Agentic where MLLMs reason over raw input, use auxiliary cues (e.g., faces, transcripts), or invoke external tools (e.g., Whisper, Pyannote, InsightFace).
  • Figure 3: RAFT framework for agentic multimodal reasoning. Given a dialogue-rich video, the model uses perception tools to extract multimodal cues. RAFT integrates SRA and RRO within a Reason–Act–Feedback loop, using perceptual consistency to refine temporal and speaker-grounded responses. RAFT() module operates only during training. Dotted arrow shows that RRO passively uses perceptual feedback for reward computation rather than active control of the tools.
  • Figure 4: Qualitative results. Comparison on multi-speaker reasoning tasks: Next-Speaker Prediction (left), Speaker Association (middle), and Temporal Grounding (right). UI: Unified-IO2, CR: CREMA, VS: VideoSALMONN, VI: VITA, Q2.5: Qwen2.5-Omni, and Q3: Qwen3-Omni under Zero-Shot, Agentic w/o RAFT, and Agentic w/ RAFT modes.
  • Figure 5: Comparison of optimization methods. Agentic performance across models on STG task.
  • ...and 10 more figures