MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
Jihao Liu, Xin Huang, Jinliang Zheng, Boxiao Liu, Jia Wang, Osamu Yoshie, Yu Liu, Hongsheng Li
TL;DR
MM-Instruct tackles the scarcity of diverse visual instruction data by transforming large caption corpora into rich instruction–answer pairs using large language models, with image grounding to ensure fidelity. The pipeline uses seed-instruction augmentation, CLIP-based image–instruction matching, and a two-stage open-LM generation process to produce 234k high-quality V-I-A triplets across roughly 300 instructions. A new GPT-4V-judged benchmark evaluates instruction-following, and training a LLaVA-Instruct model on this data yields substantial gains on 12 vision-language benchmarks as well as improved instruction-following performance. This work demonstrates that open-source LLMs, combined with grounding and filtering, can scale alignment of multimodal models to real-world user tasks, with data, benchmarks, and pretrained models publicly released.
Abstract
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
