Table of Contents
Fetching ...

EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation

Yongxin Wang, Meng Cao, Haokun Lin, Mingfei Han, Liang Ma, Jin Jiang, Yuhao Cheng, Xiaodan Liang

TL;DR

MLLMs still suffer from hallucinations and reasoning errors, and high-quality preference data is costly to obtain. EACO addresses this by training a dedicated Critic on a large critic dataset to score self-generated responses and guide refined Direct Preference Optimization, followed by enhanced supervised fine-tuning, using only 5k images for preference data. The approach yields significant reductions in hallucinations and notable gains in reasoning across multiple benchmarks, and scales across open-source backbones. This critic-based, data-efficient framework offers a practical path to improve multimodal alignment and reasoning in diverse models, with strong potential for open-source adoption.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress on various visual question answering and reasoning tasks leveraging instruction fine-tuning specific datasets. They can also learn from preference data annotated by human to enhance their reasoning ability and mitigate hallucinations. Most of preference data is generated from the model itself. However, existing methods require high-quality critical labels, which are costly and rely on human or proprietary models like GPT-4V. In this work, we propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which aligns MLLMs by self-generated preference data using only 5k images economically. Our approach begins with collecting and refining a Scoring Evaluation Instruction-tuning dataset to train a critical evaluation model, termed the Critic. This Critic observes model responses across multiple dimensions, selecting preferred and non-preferred outputs for refined Direct Preference Optimization (DPO) tuning. To further enhance model performance, we employ an additional supervised fine-tuning stage after preference tuning. EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also shows the potential critical ability in open-source MLLMs, demonstrating that EACO is a viable path to boost the competence of MLLMs.

EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation

TL;DR

MLLMs still suffer from hallucinations and reasoning errors, and high-quality preference data is costly to obtain. EACO addresses this by training a dedicated Critic on a large critic dataset to score self-generated responses and guide refined Direct Preference Optimization, followed by enhanced supervised fine-tuning, using only 5k images for preference data. The approach yields significant reductions in hallucinations and notable gains in reasoning across multiple benchmarks, and scales across open-source backbones. This critic-based, data-efficient framework offers a practical path to improve multimodal alignment and reasoning in diverse models, with strong potential for open-source adoption.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress on various visual question answering and reasoning tasks leveraging instruction fine-tuning specific datasets. They can also learn from preference data annotated by human to enhance their reasoning ability and mitigate hallucinations. Most of preference data is generated from the model itself. However, existing methods require high-quality critical labels, which are costly and rely on human or proprietary models like GPT-4V. In this work, we propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which aligns MLLMs by self-generated preference data using only 5k images economically. Our approach begins with collecting and refining a Scoring Evaluation Instruction-tuning dataset to train a critical evaluation model, termed the Critic. This Critic observes model responses across multiple dimensions, selecting preferred and non-preferred outputs for refined Direct Preference Optimization (DPO) tuning. To further enhance model performance, we employ an additional supervised fine-tuning stage after preference tuning. EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also shows the potential critical ability in open-source MLLMs, demonstrating that EACO is a viable path to boost the competence of MLLMs.

Paper Structure

This paper contains 24 sections, 3 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Upper: Response examples from original LLaVA-v1.6 7Bliu2023improvedllava and LLaVA-v1.6 7B w/ EACO, which mitigates hallucination and improve reasonging ability. Lower Left: The framework of EACO. The process begins with an image-question pair, which is fed to the initialized MLLM to genrate multiple responses. And then these responses are evaluated by a Critic Model that provides judgments regarding their quality. Based on the critic's analysis, the responses are categorized into preference and non-preference groups. Finally, preference data is subsequently collected to further improve the MLLMs. Lower Right: Comparison with LLaVA-v1.6 7Bliu2023improvedllava, LLaVA-RLHFllava_rlhf, Silkie2023vlfeedback, SIMAsima, and STICstic. Our proposed EACO framework achieves improvements across multiple metrics, demonstrating robust performance gains compared to other methods.
  • Figure 2: Various datasets used for scoring evaluation. Each dataset contains a specific number of instructions, with a total of 51,000 instructions dedicated to visual tasks and 137,486 for scoring evaluations. Since there are responses with a small score gap in the instructions, we filter out those responses and retain the ones with a larger score gap.
  • Figure 3: Critic Model training pipeline. After filtering of critic data, we combine the question, responses, and rating score to construct refined critic dataset. Here, we adopt the low-rank adaptatio (LoRA) hu2021lora to fine-tune the Critic Model.
  • Figure 4: Comparison of preferred and non-preferred responses generated for two visual content summarization examples. The left panel shows a group of elephants in a grassy enclosure, with the preferred response accurately describing the positioning and type of each elephant. The non-preferred response incorrectly describes the arrangement and adds speculative details about the weather. The right panel depicts luggage and personal items in a public waiting area with bicycles in the background. The preferred response correctly identifies key elements and setting details, while the non-preferred response includes inaccurate object descriptions and overlooks the "door".
  • Figure 5: Ablation Studies on Preference Dataset Scaling, Critic Prompt Design, and Iterative Alignment Tuning. Left: The impact of scaling up the preference dataset is shown, where expanding from 5k to 15k samples improves model performance. The figure also compares three different critic prompts: Rating Prompt (the first blue point), Additive Prompt (red star), and Subtractive Prompt (green star). Right: The impact of iterative alignment on model performance, with average scores.