Table of Contents
Fetching ...

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Xiaobo Xia, Fei Huang, Jingkuan Song, Yongbin Li

TL;DR

MMEvol tackles the data quality bottleneck in multimodal LLMs by introducing Evol-Instruct, an iterative instruction-evolution framework that combines fine-grained perception, cognitive reasoning, and interaction evolution to enrich image-text data starting from SEED-163K. Through a two-stage seed curation and a three-direction evolution cycle with explicit elimination of failed evolutions, it generates a high-quality dataset that enables state-of-the-art performance across 13 vision-language benchmarks with substantially less data than traditional approaches. Empirical results show notable improvements in accuracy and robustness, supporting the claim that carefully engineered, diverse instruction data can outperform larger quantities of lower-quality data. The work provides open evolutionary data and tooling, signaling a data-centric path forward for advancing open-source MLLMs.

Abstract

The development of Multimodal Large Language Models (MLLMs) has seen significant advancements with increasing demands in various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches attempt to enhance MLLMs capabilities through diverse architectures, the gains have become increasingly marginal. Conversely, data-driven methods, which scale up image-text instruction data, are more effective but face limited data diversity and complexity challenges. The absence of high-quality data constitutes a significant development barrier for MLLMs. To address the data quality bottleneck, we propose MMEvol, a novel multimodal instruction data evolution framework. This framework iteratively improve data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that empowers MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broaden the diversity of instruction types, extend visual reasoning steps to improve cognitive reasoning abilities, and thoroughly explore fine-grained information within images to enhance visual understanding and robustness. To comprehensively evaluate the effectiveness of our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained with the initial seed data, the results demonstrate that our method achieves an average accuracy improvement of 3.1 percentage points. Furthermore, our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

TL;DR

MMEvol tackles the data quality bottleneck in multimodal LLMs by introducing Evol-Instruct, an iterative instruction-evolution framework that combines fine-grained perception, cognitive reasoning, and interaction evolution to enrich image-text data starting from SEED-163K. Through a two-stage seed curation and a three-direction evolution cycle with explicit elimination of failed evolutions, it generates a high-quality dataset that enables state-of-the-art performance across 13 vision-language benchmarks with substantially less data than traditional approaches. Empirical results show notable improvements in accuracy and robustness, supporting the claim that carefully engineered, diverse instruction data can outperform larger quantities of lower-quality data. The work provides open evolutionary data and tooling, signaling a data-centric path forward for advancing open-source MLLMs.

Abstract

The development of Multimodal Large Language Models (MLLMs) has seen significant advancements with increasing demands in various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches attempt to enhance MLLMs capabilities through diverse architectures, the gains have become increasingly marginal. Conversely, data-driven methods, which scale up image-text instruction data, are more effective but face limited data diversity and complexity challenges. The absence of high-quality data constitutes a significant development barrier for MLLMs. To address the data quality bottleneck, we propose MMEvol, a novel multimodal instruction data evolution framework. This framework iteratively improve data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that empowers MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broaden the diversity of instruction types, extend visual reasoning steps to improve cognitive reasoning abilities, and thoroughly explore fine-grained information within images to enhance visual understanding and robustness. To comprehensively evaluate the effectiveness of our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained with the initial seed data, the results demonstrate that our method achieves an average accuracy improvement of 3.1 percentage points. Furthermore, our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
Paper Structure (18 sections, 23 figures, 11 tables)

This paper contains 18 sections, 23 figures, 11 tables.

Figures (23)

  • Figure 1: Overview of MMEvol. Instruction evolution and instruction elimination synergistically collaborate through multiple rounds to enhance the diversity and complexity of instruction data.
  • Figure 2: SEED-163K: 163K Curated Seed Instruction Tuning Dataset for Evol-Instruct.Left: The inner circle shows the original distribution of SEED-163K. The outer circle shows the curated SEED-163K. Right: All the data sources in the SEED-163K dataset, as well as the ones filtered in data curation.
  • Figure 3: Prompt Head of MMEvol. The top block showcases the contexts such as caption and visual object locations, and the middle block demonstrates vision/la nguage-centered atomic propositions and evolution objective (described later). Additionally, we endow vision capabilities with pseudo-function calls to enhance visual reasoning during evolutionary processes. Finally, the bottom block further elucidates the organized seed sample, which is subsequently sent to the MLLM for rewriting.
  • Figure 4: Fine-grained perceptual evolution prompt and example. Fine-grained perceptual evolution can generate samples with more detailed visual information, enhancing data diversity, which are marked with different colors for better visualization.
  • Figure 5: Cognitive reasoning evolution prompt template and example. Cognitive reasoning evolution can endow instruction data with a longer visual reasoning chain, increasing the complexity of the data. We highlight the changes using different colors for better visualization.
  • ...and 18 more figures