MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Jihao Liu; Xin Huang; Jinliang Zheng; Boxiao Liu; Jia Wang; Osamu Yoshie; Yu Liu; Hongsheng Li

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Jihao Liu, Xin Huang, Jinliang Zheng, Boxiao Liu, Jia Wang, Osamu Yoshie, Yu Liu, Hongsheng Li

TL;DR

MM-Instruct tackles the scarcity of diverse visual instruction data by transforming large caption corpora into rich instruction–answer pairs using large language models, with image grounding to ensure fidelity. The pipeline uses seed-instruction augmentation, CLIP-based image–instruction matching, and a two-stage open-LM generation process to produce 234k high-quality V-I-A triplets across roughly 300 instructions. A new GPT-4V-judged benchmark evaluates instruction-following, and training a LLaVA-Instruct model on this data yields substantial gains on 12 vision-language benchmarks as well as improved instruction-following performance. This work demonstrates that open-source LLMs, combined with grounding and filtering, can scale alignment of multimodal models to real-world user tasks, with data, benchmarks, and pretrained models publicly released.

Abstract

This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 10 figures, 1 table)

This paper contains 16 sections, 1 equation, 10 figures, 1 table.

Introduction
Methods
Instruction Generation
Instance Generation
Data Filtering
LLaVA-Instruct
Benchmark
Experiments
Experimental Setups
Performance on Vision-Language Benchmarks
Evaluation of Instruction-Following Capability
Data Diversity and Data Quality
Ablation Studies
Qualitative Results
Related Works
...and 1 more sections

Figures (10)

Figure 1: Example of instruction-following capability. For the given instruction, our baseline model (in green) follows the instruction and generates a post with engaging emojis and hashtags. In contrast, LLaVA-1.5's response describes a narrative instead of composing a post and has factual errors. This demonstrates our method is better able to comprehend and fulfill the intent of instructions.
Figure 2: MM-Instruct for automatic instruction data generation. (Top) In the instruction generation phase, ChatGPT is tasked with coming up with new instructions based on the image's text description. The generated instructions are then clustered and summarized into final instructions. (Bottom) In the instance generation phase, we first utilize CLIP to select a proper instruction for the input image and then employ Mixtral-8x7b to generate the answer adhering to the selected instruction.
Figure 3: Illustration of instruction generation with in-context examples. The text description is generated by an off-the-shelf LMM. The in-context examples are randomly sampled from 43 manually crafted seed instructions. We prompt ChatGPT to come up with a new instruction based on the text description and in-context examples.
Figure 4: Example of image-instruction matching. We show the top 5 instructions that match the example image, along with their corresponding scores.
Figure 5: Instruction-following evaluation using GPT-4V as the judge. We compare LLaVA-Instruct-7B/13B to 5 different approaches. Our baseline models demonstrate stronger instruction-following capabilities than InstructBLIP or LLaVA under the same model sizes.
...and 5 more figures

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

TL;DR

Abstract

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (10)