Table of Contents
Fetching ...

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen

TL;DR

This work tackles the gap in human preference alignment for open-source multi-modal LLMs by introducing OmniAlign-V, a ~200K open-ended multi-modal SFT dataset generated through a tailored data synthesis pipeline, and MM-AlignBench, a high-quality, human-annotated benchmark for evaluating alignment with human values. The authors show that finetuning with OmniAlign-V via supervised fine-tuning (SFT) or direct preference optimization (DPO) significantly improves alignment with human preferences while preserving or enhancing standard VQA capabilities. A key insight is that high-quality multi-modal data, rather than solely better language data, is crucial for improving multi-modal alignment, as evidenced by ablation and benchmark results. The work provides extensive releases (data, benchmark, code, checkpoints) and highlights the need for specialized multi-modal alignment data to realize practical, human-aligned MLLMs in real-world interactions.

Abstract

Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

TL;DR

This work tackles the gap in human preference alignment for open-source multi-modal LLMs by introducing OmniAlign-V, a ~200K open-ended multi-modal SFT dataset generated through a tailored data synthesis pipeline, and MM-AlignBench, a high-quality, human-annotated benchmark for evaluating alignment with human values. The authors show that finetuning with OmniAlign-V via supervised fine-tuning (SFT) or direct preference optimization (DPO) significantly improves alignment with human preferences while preserving or enhancing standard VQA capabilities. A key insight is that high-quality multi-modal data, rather than solely better language data, is crucial for improving multi-modal alignment, as evidenced by ablation and benchmark results. The work provides extensive releases (data, benchmark, code, checkpoints) and highlights the need for specialized multi-modal alignment data to realize practical, human-aligned MLLMs in real-world interactions.

Abstract

Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.

Paper Structure

This paper contains 22 sections, 2 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: Overall pipeline of OmniAlign-V. By utilizing an image filter and employing a customized pipeline for distinct tasks, we curate semantically rich images paired with high-quality open-ended question-answer sets. Post-refinement further enhances both the variety and quality of our dataset.
  • Figure 2: Data distribution of OmniAlign-V. Our dataset includes a diverse range of tasks, characterized by a more balanced distribution of answer lengths compared to those observed in ALLaVA and ShareGPT4V.
  • Figure 3: Samples in MM-AlignBench.
  • Figure 4: Examples of limitation with current multi-modal instruction tuning dataset.
  • Figure 5: GPT-4o shows superior alignment with human preference than InternVL2-76B.
  • ...and 12 more figures