Table of Contents
Fetching ...

MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation

Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman

TL;DR

MM-Gen tackles the gap where vision-language models underperform on specialized tasks due to generic training data. It introduces a scalable three-stage pipeline that partitions data by image type, generates task-focused annotations from a stronger VLM guided by a small reference set, and filters the synthetic data by perplexity to retain high-signal samples. Empirically, MM-Gen delivers substantial gains across chart understanding, diagram understanding, and spatial reasoning for Llava-1.5 variants, outperforming task-agnostic captions and approaching or surpassing human-curated baselines while using far less human effort. The work demonstrates that automated, task-centric data enrichment can bridge the gap between broad image-caption data and niche downstream requirements, with implications for curriculum design and ensemble strategies in future research.

Abstract

Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.

MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation

TL;DR

MM-Gen tackles the gap where vision-language models underperform on specialized tasks due to generic training data. It introduces a scalable three-stage pipeline that partitions data by image type, generates task-focused annotations from a stronger VLM guided by a small reference set, and filters the synthetic data by perplexity to retain high-signal samples. Empirically, MM-Gen delivers substantial gains across chart understanding, diagram understanding, and spatial reasoning for Llava-1.5 variants, outperforming task-agnostic captions and approaching or surpassing human-curated baselines while using far less human effort. The work demonstrates that automated, task-centric data enrichment can bridge the gap between broad image-caption data and niche downstream requirements, with implications for curriculum design and ensemble strategies in future research.

Abstract

Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.
Paper Structure (17 sections, 1 equation, 12 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Examples of general text captions vs. task-specific text annotations generated by MM-Gen and used for fine-tuning supervision.
  • Figure 2: Even high-quality human curated captions (MS COCO) miss many details found in images
  • Figure 3: Examples of different text perplexity mapping to easy cases (low perplexity), potential noise and outliers in difficulty (highest perplexity), and meaningful, non-trivial questions (middle perplexity). Questions with middle perplexity are also more likely to add new and useful training signal.
  • Figure 4: Comparing different baselines for multimodal data generation with MM-Gen. MM-Gen not only customizes the generated text to the task via reference samples, but it also adds missing details to the text that are required for answering the task.
  • Figure 5: Comparing performance of MM-Gen across Tasks against Contributed Baselines and Skyline
  • ...and 7 more figures