Table of Contents
Fetching ...

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

Radha Gulhane, Sathish Reddy Indurthi

TL;DR

HARMO introduces a hybrid and multi-aspect reward framework for aligning Multimodal Large Language Models, combining rule-based verifiable rewards with model-based signals and behavioral constraints (including a length penalty and format adherence) to overcome monolithic reward limitations. An embedding-based surrogate RM reduces annotation overhead, and a critic-free GRPO-based optimization stabilizes training. Empirical results across math, VQA, and OCR benchmarks show consistent gains, notably a ~9.5% average improvement and ~16% in mathematics for 3B-scale models, with strong performance rivaling larger proprietary systems. This work highlights the value of a diversified reward portfolio for robust MLLM alignment and suggests avenues for dynamic reward weighting and self-improving reward mechanisms.

Abstract

Aligning multimodal large language models (MLLMs) with human preferences often relies on single-signal, model-based reward methods. Such monolithic rewards often lack confidence calibration across domain-specific tasks, fail to capture diverse aspects of human preferences, and require extensive data annotation and reward model training. In this work, we propose a hybrid reward modeling framework that integrates complementary reward paradigms: (i) model-based rewards, where a learned reward model predicts scalar or vector scores from synthetic and human feedback, and (ii) rule-based rewards, where domain-specific heuristics provide explicit correctness signals with confidence. Beyond accuracy, we further incorporate multi-aspect rewards to enforce instruction adherence and introduce a generalized length-penalty reward to stabilize training and improve performance. The proposed framework provides a flexible and effective approach to aligning MLLMs through reinforcement learning policy optimization. Our experiments show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling. Our best performing model in the 3B family achieves an overall average improvement of ~9.5% across general and math reasoning tasks. Focusing specifically on mathematical benchmarks, the model achieves a significant average improvement of ~16%, highlighting its effectiveness in mathematical reasoning and problem solving.

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

TL;DR

HARMO introduces a hybrid and multi-aspect reward framework for aligning Multimodal Large Language Models, combining rule-based verifiable rewards with model-based signals and behavioral constraints (including a length penalty and format adherence) to overcome monolithic reward limitations. An embedding-based surrogate RM reduces annotation overhead, and a critic-free GRPO-based optimization stabilizes training. Empirical results across math, VQA, and OCR benchmarks show consistent gains, notably a ~9.5% average improvement and ~16% in mathematics for 3B-scale models, with strong performance rivaling larger proprietary systems. This work highlights the value of a diversified reward portfolio for robust MLLM alignment and suggests avenues for dynamic reward weighting and self-improving reward mechanisms.

Abstract

Aligning multimodal large language models (MLLMs) with human preferences often relies on single-signal, model-based reward methods. Such monolithic rewards often lack confidence calibration across domain-specific tasks, fail to capture diverse aspects of human preferences, and require extensive data annotation and reward model training. In this work, we propose a hybrid reward modeling framework that integrates complementary reward paradigms: (i) model-based rewards, where a learned reward model predicts scalar or vector scores from synthetic and human feedback, and (ii) rule-based rewards, where domain-specific heuristics provide explicit correctness signals with confidence. Beyond accuracy, we further incorporate multi-aspect rewards to enforce instruction adherence and introduce a generalized length-penalty reward to stabilize training and improve performance. The proposed framework provides a flexible and effective approach to aligning MLLMs through reinforcement learning policy optimization. Our experiments show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling. Our best performing model in the 3B family achieves an overall average improvement of ~9.5% across general and math reasoning tasks. Focusing specifically on mathematical benchmarks, the model achieves a significant average improvement of ~16%, highlighting its effectiveness in mathematical reasoning and problem solving.

Paper Structure

This paper contains 30 sections, 5 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of response lengths between the SFT baseline and the RL-aligned model (without a length penalty). The RL policy learns a brevity bias, producing shorter and often incomplete responses.
  • Figure 2: Training dynamics with and without the proposed length penalty. (a) The accuracy reward consistently improves. (b) The length penalty successfully counteracts the model's tendency to produce shorter responses, promoting more stable and desirable output lengths.
  • Figure 3: Case Study 1 - Math Cube Problem
  • Figure 4: Case Study 2 - Solving a General Math Problem
  • Figure 5: Case Study 3 - Math Puzzle Problem
  • ...and 2 more figures