Table of Contents
Fetching ...

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou

TL;DR

Genixer tackles the rising cost and dependency on GPT-4V for visual instruction tuning data by introducing a four-step pipeline that trains MLLMs to generate data themselves. It implements two data-generation modes (task-agnostic and task-specific) using two backbones, Genixer_L (LLaVA1.5) and Genixer_S (Shikra), paired with two automatic data-filtering pipelines to produce Genixer-915K (VQA-like) and Genixer-350K (REC-like). Empirical results show synthetic data improves multiple multimodal benchmarks (e.g., VizWiz, ScienceQA, MME) and reduces hallucinations, with Genixer-L and Genixer-S delivering substantial gains across diverse tasks. The work also provides analyses, human evaluations, and a user study, and releases code, models, and datasets at https://github.com/zhaohengyuan1/Genixer, highlighting practical impact on scalable, low-cost multimodal data generation.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data. This paper proposes to explore the potential of empowering MLLMs to generate data independently without relying on GPT-4. We introduce Genixer, a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer.

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

TL;DR

Genixer tackles the rising cost and dependency on GPT-4V for visual instruction tuning data by introducing a four-step pipeline that trains MLLMs to generate data themselves. It implements two data-generation modes (task-agnostic and task-specific) using two backbones, Genixer_L (LLaVA1.5) and Genixer_S (Shikra), paired with two automatic data-filtering pipelines to produce Genixer-915K (VQA-like) and Genixer-350K (REC-like). Empirical results show synthetic data improves multiple multimodal benchmarks (e.g., VizWiz, ScienceQA, MME) and reduces hallucinations, with Genixer-L and Genixer-S delivering substantial gains across diverse tasks. The work also provides analyses, human evaluations, and a user study, and releases code, models, and datasets at https://github.com/zhaohengyuan1/Genixer, highlighting practical impact on scalable, low-cost multimodal data generation.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data. This paper proposes to explore the potential of empowering MLLMs to generate data independently without relying on GPT-4. We introduce Genixer, a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. Additionally, we outline two modes of data generation: task-agnostic and task-specific, enabling controllable output. We demonstrate that a synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. Additionally, the grounding MLLM Shikra, when trained with a REC-like synthetic dataset, shows improvements on 7 out of 8 REC datasets. Through experiments and synthetic data analysis, our findings are: (1) current MLLMs can serve as robust data generators without assistance from GPT-4V; (2) MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data; (3) synthetic datasets enhance performance across various multimodal benchmarks and help mitigate model hallucinations. The data, code, and models can be found at https://github.com/zhaohengyuan1/Genixer.
Paper Structure (15 sections, 4 equations, 16 figures, 9 tables)

This paper contains 15 sections, 4 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Two unsatisfied generation examples from GPT-4V gpt4v. Our proposed data generator Genixer$_{S}$ is capable of generating complex multimodal data such as REC and REG data, whereas GPT-4V fails to generate the correct bounding box.
  • Figure 2: The illustration of our proposed automatic data generation pipeline Genixer.
  • Figure 2: Fuyu-8B evaluation result on Flickr30K image dataset. Accuracy refers to the "Yes" prediction. Prob. represents the probability.
  • Figure 3: Selected examples generated from Genixer$_{L}$ and Genixer$_{S}$. The examples include Common VQA, Adv VQA, MC VQA, MD, and five grounding tasks.
  • Figure 4: A demonstration of two proposed instruction modes during the inference phase.
  • ...and 11 more figures