Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Letian Zhang, Quan Cui, Bingchen Zhao, Cheng Yang
TL;DR
Oasis tackles the scarcity of publicly available multimodal training data by proposing a simple, image-only data synthesis pipeline that generates high-quality, diverse instruction data. The method hinges on a three-step process: (1) data synthesis with a hooking prompt to produce instructions from a single image, (2) data categorization to filter instruction-following data from captions, and (3) a rigorous instruction quality control to ensure solvability, clarity, and alignment with the image, discarding low-quality data. Empirically, Oasis yields 500k synthetic examples (Oasis-500k) and delivers consistent gains across 14 benchmarks with multiple backbones, outperforming several existing synthesis approaches and scaling effectively up to 500k data. The work demonstrates substantial domain adaptability (e.g., OCR) and provides a practical, scalable pathway for improving multimodal models without labor-intensive data curation, with code and data to be released for community use. All formulas and notation cited, such as $Inst = Θ(vision)$ and $Resp = Θ(vision, instruction)$, are used to formalize the hooking and generation process.
Abstract
The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and dataset are publicly available at https://github.com/Letian2003/MM_INF.
