Table of Contents
Fetching ...

MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang

TL;DR

The paper tackles the scarcity of high-quality multimodal instruction-following data and the limitations of existing IF benchmarks by introducing MM-IFEngine, a three-stage pipeline to generate image-instruction pairs, and MM-IFEval, a diverse, constraint-rich benchmark. It yields two datasets, MM-IFInstruct-23k for SFT and MM-IFDPO-23k for DPO, and demonstrates substantial IF performance gains when fine-tuning with these resources, while maintaining VQA capabilities. The work advances multimodal instruction following through a structured data-generation framework, a comprehensive constraint taxonomy, and a robust hybrid evaluation protocol, with open-sourced datasets and tools to foster further progress.

Abstract

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.

MM-IFEngine: Towards Multimodal Instruction Following

TL;DR

The paper tackles the scarcity of high-quality multimodal instruction-following data and the limitations of existing IF benchmarks by introducing MM-IFEngine, a three-stage pipeline to generate image-instruction pairs, and MM-IFEval, a diverse, constraint-rich benchmark. It yields two datasets, MM-IFInstruct-23k for SFT and MM-IFDPO-23k for DPO, and demonstrates substantial IF performance gains when fine-tuning with these resources, while maintaining VQA capabilities. The work advances multimodal instruction following through a structured data-generation framework, a comprehensive constraint taxonomy, and a robust hybrid evaluation protocol, with open-sourced datasets and tools to foster further progress.

Abstract

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2), MIA (+7.6), and IFEval (+12.3). We have fully open-sourced the datasets (both SFT and DPO), evaluation code and training scripts at https://github.com/SYuan03/MM-IFEngine.

Paper Structure

This paper contains 26 sections, 3 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Overall pipeline of MM-IFEngine. Part (a) demonstrates the three-stage workflow of our engine: (1) Image filter; (2) Task generation using GPT-4o for images without QA pairs and instruct refinement for existing annotations; and (3) Constraints integration incorporating 6 main categories and 32 subcategories, ensuring compatibility between constraints and tasks. MM-IFEngine is employed to generate SFT and DPO training datasets and MM-IFEval benchmark, as shown in part (b) and (c). MM-IFEval implements three evaluation metrics combining rule-based verification functions and a judge model to ensure accurate assessment.
  • Figure 2: Constraint Quantity Distribution in MM-IFInstruct-23k. Our MM-IFInstruct-23k exhibits systematic variation in constraint complexity, with each sample containing 3-12 constraints per instruction.
  • Figure 3: Constraint Category Distribution in Compose-Level Problems of MM-IFEval. This part comprises six primary constraint categories with 32 subcategories, forming a multi-level taxonomy for instruction-following evaluation.
  • Figure 4: Demonstration of constraints categories. We designed 6 main categories for all the constraints used, with a total of 32 subcategories
  • Figure 5: Image Source Distribution in perception-level problems.Perception-level problems in MM-IFEval presents a systematic categorization of 100 challenging vision-based instruction-following tasks, organized into 13 distinct classes according to image content characteristics and task complexity.
  • ...and 16 more figures