Table of Contents
Fetching ...

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

Yankai Yang, Yancheng Long, Hongyang Wei, Wei Chen, Tianke Zhang, Kaiyu Jiang, Haonan Fan, Changyi Liu, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

TL;DR

The paper tackles the trade-off between semantic understanding and inference efficiency in reward models for RLHF on multimodal tasks. It introduces Joint Reward Modeling (JRM), which jointly trains a shared vision–language backbone on preference ranking and language modeling to internalize chain-of-thought in latent representations while discarding the language generation path at inference. Empirically, JRM achieves state-of-the-art results on EditReward-Bench and MMRB2, and yields significant gains in downstream Flow-GRPO online RL, backed by representation analyses showing higher effective rank and reduced collapse. The findings demonstrate that discriminative reward models can inherit deep reasoning capabilities from generative models without incurring inference costs, enabling scalable and efficient multimodal alignment.

Abstract

Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.

Joint Reward Modeling: Internalizing Chain-of-Thought for Efficient Visual Reward Models

TL;DR

The paper tackles the trade-off between semantic understanding and inference efficiency in reward models for RLHF on multimodal tasks. It introduces Joint Reward Modeling (JRM), which jointly trains a shared vision–language backbone on preference ranking and language modeling to internalize chain-of-thought in latent representations while discarding the language generation path at inference. Empirically, JRM achieves state-of-the-art results on EditReward-Bench and MMRB2, and yields significant gains in downstream Flow-GRPO online RL, backed by representation analyses showing higher effective rank and reduced collapse. The findings demonstrate that discriminative reward models can inherit deep reasoning capabilities from generative models without incurring inference costs, enabling scalable and efficient multimodal alignment.

Abstract

Reward models are critical for reinforcement learning from human feedback, as they determine the alignment quality and reliability of generative models. For complex tasks such as image editing, reward models are required to capture global semantic consistency and implicit logical constraints beyond local similarity. Existing reward modeling approaches have clear limitations. Discriminative reward models align well with human preferences but struggle with complex semantics due to limited reasoning supervision. Generative reward models offer stronger semantic understanding and reasoning, but they are costly at inference time and difficult to align directly with human preferences. To this end, we propose Joint Reward Modeling (JRM), which jointly optimizes preference learning and language modeling on a shared vision-language backbone. This approach internalizes the semantic and reasoning capabilities of generative models into efficient discriminative representations, enabling fast and accurate evaluation. JRM achieves state-of-the-art results on MMRB2 and EditReward-Bench, and significantly improves stability and performance in downstream online reinforcement learning. These results show that joint training effectively bridges efficiency and semantic understanding in reward modeling.
Paper Structure (26 sections, 10 equations, 16 figures, 4 tables)

This paper contains 26 sections, 10 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Performance comparison.Left: Comparison of Generative, Discriminative, and Joint Reward Modeling (JRM) paradigms. Right: JRM achieves state-of-the-art accuracy on benchmarks and significantly boosts downstream RL performance. Note: in the performance comparison, Generative RM refers to EditScore editscore, and Discriminative RM refers to EditReward editreward.
  • Figure 2: Illustration of the JRM Framework. The workflow consists of three stages: (1) Joint Training, where the model internalizes reasoning capabilities; (2) Efficient Inference, which retains only the discriminative pathway; and (3) the Online RL Loop, where JRM provides scalable feedback for downstream alignment.
  • Figure 3: Attention visualization. Compared to the baseline, JRM accurately focuses on salient regions specified by editing instructions, where the baseline refers to a discriminative reward model without semantic supervision.
  • Figure 4: Impact of different language supervision weights $\alpha$ on model performance. As $\alpha$ increases, model performance steadily improves on both benchmarks. Note: except for $\alpha=0.7$ (the chosen setting for JRM), the optimal checkpoints differ between the two benchmarks; results shown represent the best performance achieved on each benchmark respectively.
  • Figure 5: Training dynamics of component losses.Left: Cross-entropy loss (language supervision). Right: Ranking loss under different $\alpha$ values.
  • ...and 11 more figures