Table of Contents
Fetching ...

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang

TL;DR

OmniVideo-R1 tackles modality bias in omnivideo reasoning by introducing a post-training RL framework with two stages: query-intensive grounding (QI) and modality-attentive fusion (MA). It employs GSPO to optimize sequence-level reasoning and uses self-supervised grounding alongside contrastive fusion to learn when and how to fuse audio-visual cues, without process-level annotations. A large, curated dataset drives the training, and results across multiple benchmarks show consistent improvements over strong baselines while maintaining visual-only performance. The work advances robust, interpretable omnimodal reasoning and offers a practical path toward better cross-modal integration in real-world tasks.

Abstract

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

TL;DR

OmniVideo-R1 tackles modality bias in omnivideo reasoning by introducing a post-training RL framework with two stages: query-intensive grounding (QI) and modality-attentive fusion (MA). It employs GSPO to optimize sequence-level reasoning and uses self-supervised grounding alongside contrastive fusion to learn when and how to fuse audio-visual cues, without process-level annotations. A large, curated dataset drives the training, and results across multiple benchmarks show consistent improvements over strong baselines while maintaining visual-only performance. The work advances robust, interpretable omnimodal reasoning and offers a practical path toward better cross-modal integration in real-world tasks.

Abstract

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
Paper Structure (29 sections, 10 equations, 12 figures, 5 tables)

This paper contains 29 sections, 10 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Pre-trained MLLMs (e.g., Qwen3-Omni) often exhibit suboptimal performance in audio-visual reasoning tasks due to inherent modality bias. To address this limitation, we reinforce the audio-visual reasoning ability by leveraging query intention and modality attention.
  • Figure 2: The schematic illustration of our OmniVideo-R1. Based on the dataset collected from data preparation, our training consists of two stages: (1) QI stage establishes query-intensive grounding behavior by aligning multiple time–caption pairs without process-level annotations. (2) MA stage further performs modality-attentive fusion by optimizing a contrastive modality reward.
  • Figure 3: Visualization of the responses and underlying reasoning process generated by OmniVideo-R1 and Qwen3-Omni-30B-A3B-Instruct, -Thinking to an audio-visual understanding question.
  • Figure 4: Pipeline for our data preparation consisting of 3 stages.
  • Figure 5: Visualization of the results obtained from the training of QI, and QI+MA. Red highlights the incorrect text, while green highlights the correct text. Yellow highlights the model overemphasizes one modality while neglecting cues from the other modality.
  • ...and 7 more figures