PerPO: Perceptual Preference Optimization via Discriminative Rewarding
Zining Zhu, Liang Zhao, Kangheng Lin, Jinze Yang, En Yu, Chenglong Liu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang
TL;DR
PerPO addresses the gap between generative prowess and visual discrimination in multimodal LLMs by using discriminative rewards to generate diverse negative samples and a listwise ranking objective to align outputs with human perceptual criteria. It unifies discriminative empirical risk minimization with generative objectives, mitigating image-unconditional reward hacking and boosting performance on object grounding, dense OCR, and general image understanding across multiple baselines. The results show consistent gains, robustness to visual input, and better alignment with human judgments, suggesting PerPO as a practical pathway toward perceptually aligned MLLMs. The work also emphasizes data efficiency, margin design, and the importance of negative sample diversity for scalable perceptual alignment.
Abstract
This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs). To align MLLMs with human visual perception process, PerPO employs discriminative rewarding to gather diverse negative samples, followed by listwise preference optimization to rank them.By utilizing the reward as a quantitative margin for ranking, our method effectively bridges generative preference optimization and discriminative empirical risk minimization. PerPO significantly enhances MLLMs' visual discrimination capabilities while maintaining their generative strengths, mitigates image-unconditional reward hacking, and ensures consistent performance across visual tasks. This work marks a crucial step towards more perceptually aligned and versatile MLLMs. We also hope that PerPO will encourage the community to rethink MLLM alignment strategies.
