Towards Robust Instruction Tuning on Multimodal Large Language Models

Wei Han; Hui Chen; Soujanya Poria

Towards Robust Instruction Tuning on Multimodal Large Language Models

Wei Han, Hui Chen, Soujanya Poria

TL;DR

The paper tackles the high cost of generating large-scale multimodal instruction-following data by introducing InstrExp, an automatic instruction augmentation framework that expands a small set of meta-instructions up to about 30x. It builds the MIns+ dataset through meta-prompt guided generation, Placeholder-Protected Generation, and rule-based postprocessing, followed by fine-tuning multimodal LLMs on the augmented data. Empirical results across multiple base models (OFA, InstructBLIP, LLaVA) and benchmarks (MultiInstruct, InstructBLIP, MMMU) show that MIns+ consistently improves task alignment and zero-shot generalization, often matching or surpassing the gains from substantially larger manually curated datasets. The work demonstrates significant practical impact by reducing human labor and achieving robust improvements across a spectrum of multimodal tasks, with insights into task attributes and sampling strategies that influence effectiveness.

Abstract

Fine-tuning large language models (LLMs) on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal large language models (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

Towards Robust Instruction Tuning on Multimodal Large Language Models

TL;DR

Abstract

Paper Structure (52 sections, 7 equations, 7 figures, 15 tables, 1 algorithm)

This paper contains 52 sections, 7 equations, 7 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Instruction Finetuning on MLLMs
Automatic Instruction Generation
Method
Multimodal Instruction Fine-tuning (MIFT)
Generation Process
Meta-Prompts
Handling Placeholders
Postprocessing
Fine-tuning on MIns+
Source Dataset
Dataset Construction
Experiments
Experiment Setup
...and 37 more sections

Figures (7)

Figure 1: Zero-shot performance on MultiInstruct test set (9 tasks) by OFA tuned on each instruction-following dataset. By expanding the instruction set several times using automatically generated instructions ("MINS,59K" to "MINS+,59K"), the average score is close to that tuned using 10x more data ("MINS+,59K" compared to "MINS,564K", highlighted by the arrow).
Figure 2: The overall framework of InstrExp. The self-loop only iterates once when we roll out instructions from the meta-prompt (e.g., "generate 10 instruction about ...").
Figure 3: Average performance gain of three versions of MIns+ on OFA-Large vs. instance text proportion by task ($\rho=0.639$).
Figure 4: Results on MIns+ (59K) of different $\epsilon$ values for both models. $\epsilon=1.0$ is the result on MIns.
Figure 5: Distributions of instruction lengths of the original and generated datasets. The horizontal axis represents the instruction length and the vertical axis is the frequency in that dataset.
...and 2 more figures

Towards Robust Instruction Tuning on Multimodal Large Language Models

TL;DR

Abstract

Towards Robust Instruction Tuning on Multimodal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)