Automated Multi-level Preference for MLLMs

Mengxi Zhang; Wenhao Wu; Yu Lu; Yuxin Song; Kang Rong; Huanjin Yao; Jianbo Zhao; Fanglong Liu; Yifan Sun; Haocheng Feng; Jingdong Wang

Automated Multi-level Preference for MLLMs

Mengxi Zhang, Wenhao Wu, Yu Lu, Yuxin Song, Kang Rong, Huanjin Yao, Jianbo Zhao, Fanglong Liu, Yifan Sun, Haocheng Feng, Jingdong Wang

TL;DR

The paper tackles hallucinations in multimodal LLMs by replacing binary human feedback with automated multi-level preferences. It introduces AMP, combining a human-free data generation pipeline (MEG and Incremental Generation) with an MDPO learning objective and an auto-check mechanism to create reliable $K$-level preference datasets. A novel MRHal-Bench benchmark assesses hallucinations in multi-round dialogues, and extensive experiments show AMP surpasses general MLLMs and RLHF-based baselines across multiple benchmarks. The work provides practical, scalable guidance for grounding MLLMs while reducing annotation overhead, and offers open-source code for replication.

Abstract

Current multimodal Large Language Models (MLLMs) suffer from ``hallucination'', occasionally generating responses that are not grounded in the input images. To tackle this challenge, one promising path is to utilize reinforcement learning from human feedback (RLHF), which steers MLLMs towards learning superior responses while avoiding inferior ones. We rethink the common practice of using binary preferences (i.e., superior, inferior), and find that adopting multi-level preferences (e.g., superior, medium, inferior) is better for two benefits: 1) It narrows the gap between adjacent levels, thereby encouraging MLLMs to discern subtle differences. 2) It further integrates cross-level comparisons (beyond adjacent-level comparisons), thus providing a broader range of comparisons with hallucination examples. To verify our viewpoint, we present the Automated Multi-level Preference (AMP) framework for MLLMs. To facilitate this framework, we first develop an automated dataset generation pipeline that provides high-quality multi-level preference datasets without any human annotators. Furthermore, we design the Multi-level Direct Preference Optimization (MDPO) algorithm to robustly conduct complex multi-level preference learning. Additionally, we propose a new hallucination benchmark, MRHal-Bench. Extensive experiments across public hallucination and general benchmarks, as well as our MRHal-Bench, demonstrate the effectiveness of our proposed method. Code is available at https://github.com/takomc/amp.

Automated Multi-level Preference for MLLMs

TL;DR

-level preference datasets. A novel MRHal-Bench benchmark assesses hallucinations in multi-round dialogues, and extensive experiments show AMP surpasses general MLLMs and RLHF-based baselines across multiple benchmarks. The work provides practical, scalable guidance for grounding MLLMs while reducing annotation overhead, and offers open-source code for replication.

Abstract

Paper Structure (28 sections, 8 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Multimodal Large Language Models
Hallucinations in MLLMs
Methods
Human-free Multi-level Preference Dataset Generation
Multi-size Expert Generation
Incremental Generation
Auto-check Mechanism
Multi-level Direct Preference Optimization (MDPO)
Preliminary
Learning Objective of MDPO Algorithm
Experiments and Analysis
Implementation Details
Evaluation Benchmarks
...and 13 more sections

Figures (6)

Figure 1: Left: Depicted are the input image, text prompt, and corresponding multi-level preference dataset. Contents highlighted in red signify hallucinations. Responses range from A to C, representing varying degrees of quality from superior to inferior. Right: Illustrating the strategy for leveraging inferior responses. (a) displays the conventional RLHF baseline, which adpots the binary-level preference. (b) To mitigate the gap between adjacent levels, we first split a single comparison into multiple comparisons by inserting extra medium responses. (c) Furthermore, we introduce the cross-level comparison to augment the dataset with more hallucination examples.
Figure 2: Pipeline for Constructing Human-free Multi-level Preference Dataset. We initiate the process with Multi-size Expert Generation and Incremental Generation to establish the initial dataset. Then, to enhance the quality of the initial preference dataset, we introduce the Auto-check Mechanism, which calculates both global and local metrics based on sentences and noun chunks, respectively.
Figure 3: Case studies including our AMP-MEG, LLaVA-V1.5 llava15, and LLaVA-RLHF llavarlhf. Hallucinations, correct responses are highlighted in different colors. Please zoom in for the best view.
Figure 4: Omitted nouns in the auto-check mechanism.
Figure 5: The text prompt for GPT-4V in the annotated process of multi-round dialogue.
...and 1 more figures

Automated Multi-level Preference for MLLMs

TL;DR

Abstract

Automated Multi-level Preference for MLLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (6)