Table of Contents
Fetching ...

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

Letian Zhang, Quan Cui, Bingchen Zhao, Cheng Yang

TL;DR

Oasis tackles the scarcity of publicly available multimodal training data by proposing a simple, image-only data synthesis pipeline that generates high-quality, diverse instruction data. The method hinges on a three-step process: (1) data synthesis with a hooking prompt to produce instructions from a single image, (2) data categorization to filter instruction-following data from captions, and (3) a rigorous instruction quality control to ensure solvability, clarity, and alignment with the image, discarding low-quality data. Empirically, Oasis yields 500k synthetic examples (Oasis-500k) and delivers consistent gains across 14 benchmarks with multiple backbones, outperforming several existing synthesis approaches and scaling effectively up to 500k data. The work demonstrates substantial domain adaptability (e.g., OCR) and provides a practical, scalable pathway for improving multimodal models without labor-intensive data curation, with code and data to be released for community use. All formulas and notation cited, such as $Inst = Θ(vision)$ and $Resp = Θ(vision, instruction)$, are used to formalize the hooking and generation process.

Abstract

The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and dataset are publicly available at https://github.com/Letian2003/MM_INF.

Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

TL;DR

Oasis tackles the scarcity of publicly available multimodal training data by proposing a simple, image-only data synthesis pipeline that generates high-quality, diverse instruction data. The method hinges on a three-step process: (1) data synthesis with a hooking prompt to produce instructions from a single image, (2) data categorization to filter instruction-following data from captions, and (3) a rigorous instruction quality control to ensure solvability, clarity, and alignment with the image, discarding low-quality data. Empirically, Oasis yields 500k synthetic examples (Oasis-500k) and delivers consistent gains across 14 benchmarks with multiple backbones, outperforming several existing synthesis approaches and scaling effectively up to 500k data. The work demonstrates substantial domain adaptability (e.g., OCR) and provides a practical, scalable pathway for improving multimodal models without labor-intensive data curation, with code and data to be released for community use. All formulas and notation cited, such as and , are used to formalize the hooking and generation process.

Abstract

The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and dataset are publicly available at https://github.com/Letian2003/MM_INF.

Paper Structure

This paper contains 50 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison of previous methods and our proposed Oasis framework for multi-modal data synthesis. Previous approaches rely on an input image, complex prompts for generating text, and human labor. Responses are generated from advanced LLMs and MLLMs (e.g., GPT-4V gpt4 and Qwen2-VL qwen_vlqwen2_vl). Interestingly, our proposed Oasis requires only a single image to generate multi-modal instruction-following data, showing great simplicity and practical value.
  • Figure 2: Detailed Oasis pipeline. This figure illustrates the full process of data synthesis with Oasis. The pipeline consists of three steps: data synthesis, data categorization, and instruction quality control. In Step 1, we break the traditional input tokens and entice a strong MLLM to generate instructions based on the image. In Step 2, we filter out non-instruction-following data by an LLM. In Step 3, a quality control mechanism is proposed to ensure the remained instructions are reasonable and high-quality. Oasis exhibits a straightforward and efficient way to synthesize multi-modal training data with low demands (i.e., a single image input). The empirical results show that our method can significantly improve the performance of MLLMs.
  • Figure 3: Language type breakdown. The distribution of language types in Oasis data. English takes up the majority, while other languages are also well-represented. In total, 46 language types are included in the dataset.
  • Figure 4: Root verbs and top noun objects. The charts show the most common root verbs and their top 3 noun objects in LLaVA-NeXT and Oasis data. Word combinations in LLaVA-NeXT data are quite concentrated, e.g., "answer question" and "provide description". Conversely, words in Oasis data are more natural and representative.
  • Figure 5: Oasis synthetic data instances. We present several examples illustrating the robustness of our data synthesis approach across diverse and rare domains (e.g., Chart, Math, Code and Science). Interestingly, the bottom-left figure shows that Oasis can generate complex math questions based on a chart image. Oasis identifies patterns and context within the table (e.g., daily spending amounts) and the related textual query ('Is Darnell correct about spending an average of $2 each day on lunch?')
  • ...and 3 more figures