Table of Contents
Fetching ...

PerPO: Perceptual Preference Optimization via Discriminative Rewarding

Zining Zhu, Liang Zhao, Kangheng Lin, Jinze Yang, En Yu, Chenglong Liu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang

TL;DR

PerPO addresses the gap between generative prowess and visual discrimination in multimodal LLMs by using discriminative rewards to generate diverse negative samples and a listwise ranking objective to align outputs with human perceptual criteria. It unifies discriminative empirical risk minimization with generative objectives, mitigating image-unconditional reward hacking and boosting performance on object grounding, dense OCR, and general image understanding across multiple baselines. The results show consistent gains, robustness to visual input, and better alignment with human judgments, suggesting PerPO as a practical pathway toward perceptually aligned MLLMs. The work also emphasizes data efficiency, margin design, and the importance of negative sample diversity for scalable perceptual alignment.

Abstract

This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs). To align MLLMs with human visual perception process, PerPO employs discriminative rewarding to gather diverse negative samples, followed by listwise preference optimization to rank them.By utilizing the reward as a quantitative margin for ranking, our method effectively bridges generative preference optimization and discriminative empirical risk minimization. PerPO significantly enhances MLLMs' visual discrimination capabilities while maintaining their generative strengths, mitigates image-unconditional reward hacking, and ensures consistent performance across visual tasks. This work marks a crucial step towards more perceptually aligned and versatile MLLMs. We also hope that PerPO will encourage the community to rethink MLLM alignment strategies.

PerPO: Perceptual Preference Optimization via Discriminative Rewarding

TL;DR

PerPO addresses the gap between generative prowess and visual discrimination in multimodal LLMs by using discriminative rewards to generate diverse negative samples and a listwise ranking objective to align outputs with human perceptual criteria. It unifies discriminative empirical risk minimization with generative objectives, mitigating image-unconditional reward hacking and boosting performance on object grounding, dense OCR, and general image understanding across multiple baselines. The results show consistent gains, robustness to visual input, and better alignment with human judgments, suggesting PerPO as a practical pathway toward perceptually aligned MLLMs. The work also emphasizes data efficiency, margin design, and the importance of negative sample diversity for scalable perceptual alignment.

Abstract

This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs). To align MLLMs with human visual perception process, PerPO employs discriminative rewarding to gather diverse negative samples, followed by listwise preference optimization to rank them.By utilizing the reward as a quantitative margin for ranking, our method effectively bridges generative preference optimization and discriminative empirical risk minimization. PerPO significantly enhances MLLMs' visual discrimination capabilities while maintaining their generative strengths, mitigates image-unconditional reward hacking, and ensures consistent performance across visual tasks. This work marks a crucial step towards more perceptually aligned and versatile MLLMs. We also hope that PerPO will encourage the community to rethink MLLM alignment strategies.

Paper Structure

This paper contains 18 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) Examples of visual generative and discriminative tasks. (b) Performance comparison in RefCOCOg 2016Generation with increasing list size for SFT, DPO, PerPO, and Best-of-N. (c) Performance comparison of PerPO and DPO with and without image input across different benchmarks. Notably, PerPO shows a greater performance gap, highlighting a strong reliance on image conditioning.
  • Figure 2: Analysis of training data quality, quantity, and hyperparameter $\beta$ (a) Performance across different data margins. (b) Performance across different data sizes. (c) Performance across different $\beta$ values in the loss function.
  • Figure 3: Relative performance (Left, Human users as judge) and comparative showcases (Right) with and without PerPO alignment across different tasks.
  • Figure 4: Performance of different $\gamma$ values in PerPO loss.
  • Figure 5: Comparison of PerPO and SFT across different dense OCR levels.
  • ...and 1 more figures