Table of Contents
Fetching ...

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Zhuoran Jin, Hongbang Yuan, Kejian Zhu, Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

TL;DR

Omni-Reward tackles two core challenges in omni-modal reward modeling: modality imbalance and preference rigidity. It introduces Omni-RewardBench, a first omni-modal RM benchmark with free-form preferences across nine tasks and five modalities, and Omni-RewardData, a large multimodal preference dataset with both general and instruction-tuning data. The authors develop two RM architectures—Omni-RewardModel-BT (discriminative, Bradley-Terry) and Omni-RewardModel-R1 (generative, RL with chain-of-thought)—and demonstrate strong performance on Omni-RewardBench and state-of-the-art results on public RM benchmarks. The work highlights the importance of mixed multimodal training data and instruction-tuning for cross-modality reward alignment, with potential to improve real-world alignment of multimodal AI systems.

Abstract

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

TL;DR

Omni-Reward tackles two core challenges in omni-modal reward modeling: modality imbalance and preference rigidity. It introduces Omni-RewardBench, a first omni-modal RM benchmark with free-form preferences across nine tasks and five modalities, and Omni-RewardData, a large multimodal preference dataset with both general and instruction-tuning data. The authors develop two RM architectures—Omni-RewardModel-BT (discriminative, Bradley-Terry) and Omni-RewardModel-R1 (generative, RL with chain-of-thought)—and demonstrate strong performance on Omni-RewardBench and state-of-the-art results on public RM benchmarks. The work highlights the importance of mixed multimodal training data and instruction-tuning for cross-modality reward alignment, with potential to improve real-world alignment of multimodal AI systems.

Abstract

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

Paper Structure

This paper contains 44 sections, 1 equation, 17 figures, 20 tables.

Figures (17)

  • Figure 1: Illustration of nine reward modeling tasks in Omni-RewardBench.
  • Figure 2: Overview of the architecture of Omni-RewardModel.
  • Figure 3: Performance of open-source models, closed-source models, and our proposed model on the nine tasks in Omni-RewardBench, with results under w/ Tie (left) and w/o Tie (right).
  • Figure 4: Performance correlation across various tasks in Omni-RewardBench.
  • Figure 5: Construction workflow of Omni-RewardBench.
  • ...and 12 more figures