Table of Contents
Fetching ...

Learning from Reasoning Failures via Synthetic Data Generation

Gabriela Ben Melech Stan, Estelle Aflalo, Avinash Madasu, Vasudev Lal, Phillip Howard

TL;DR

This paper tackles data scarcity for vision-language models by proposing a failure-guided synthetic data generation framework that grounds data creation in the reasoning failures of a baseline LMM. A frontier multimodal model analyzes errors to produce targeted image–text samples (QA pairs and images), which are filtered for quality before being used to instruction-tune large multimodal models. The authors assemble a dataset of $N=553{,}992$ synthetic samples across VizWiz, InfoVQA, ScienceQA, and OK-VQA and demonstrate substantial in-domain gains—often outperforming equal amounts of real-domain data—and robust generalization to other backbones and tasks, including continued finetuning scenarios. The work shows that focusing synthetic data on specific failure modes yields yields better efficiency and broader applicability, with public release of data and code supporting adoption in low-resource settings. Overall, the method provides a scalable, targeted path for teaching new multimodal reasoning skills to large language-vision models.

Abstract

Training models on synthetic data has emerged as an increasingly important strategy for improving the performance of generative AI. This approach is particularly helpful for large multimodal models (LMMs) due to the relative scarcity of high-quality paired image-text data compared to language-only data. While a variety of methods have been proposed for generating large multimodal datasets, they do not tailor the synthetic data to address specific deficiencies in the reasoning abilities of LMMs which will be trained with the generated dataset. In contrast, humans often learn in a more efficient manner by seeking out examples related to the types of reasoning where they have failed previously. Inspired by this observation, we propose a new approach for synthetic data generation which is grounded in the analysis of an existing LMM's reasoning failures. Our methodology leverages frontier models to automatically analyze errors produced by a weaker LMM and propose new examples which can be used to correct the reasoning failure via additional training, which are then further filtered to ensure high quality. We generate a large multimodal instruction tuning dataset containing over 553k examples using our approach and conduct extensive experiments demonstrating its utility for improving the performance of LMMs on multiple downstream tasks. Our results show that models trained on our synthetic data can even exceed the performance of LMMs trained on an equivalent amount of additional real data, demonstrating the high value of generating synthetic data targeted to specific reasoning failure modes in LMMs. We will make our dataset and code publicly available.

Learning from Reasoning Failures via Synthetic Data Generation

TL;DR

This paper tackles data scarcity for vision-language models by proposing a failure-guided synthetic data generation framework that grounds data creation in the reasoning failures of a baseline LMM. A frontier multimodal model analyzes errors to produce targeted image–text samples (QA pairs and images), which are filtered for quality before being used to instruction-tune large multimodal models. The authors assemble a dataset of synthetic samples across VizWiz, InfoVQA, ScienceQA, and OK-VQA and demonstrate substantial in-domain gains—often outperforming equal amounts of real-domain data—and robust generalization to other backbones and tasks, including continued finetuning scenarios. The work shows that focusing synthetic data on specific failure modes yields yields better efficiency and broader applicability, with public release of data and code supporting adoption in low-resource settings. Overall, the method provides a scalable, targeted path for teaching new multimodal reasoning skills to large language-vision models.

Abstract

Training models on synthetic data has emerged as an increasingly important strategy for improving the performance of generative AI. This approach is particularly helpful for large multimodal models (LMMs) due to the relative scarcity of high-quality paired image-text data compared to language-only data. While a variety of methods have been proposed for generating large multimodal datasets, they do not tailor the synthetic data to address specific deficiencies in the reasoning abilities of LMMs which will be trained with the generated dataset. In contrast, humans often learn in a more efficient manner by seeking out examples related to the types of reasoning where they have failed previously. Inspired by this observation, we propose a new approach for synthetic data generation which is grounded in the analysis of an existing LMM's reasoning failures. Our methodology leverages frontier models to automatically analyze errors produced by a weaker LMM and propose new examples which can be used to correct the reasoning failure via additional training, which are then further filtered to ensure high quality. We generate a large multimodal instruction tuning dataset containing over 553k examples using our approach and conduct extensive experiments demonstrating its utility for improving the performance of LMMs on multiple downstream tasks. Our results show that models trained on our synthetic data can even exceed the performance of LMMs trained on an equivalent amount of additional real data, demonstrating the high value of generating synthetic data targeted to specific reasoning failure modes in LMMs. We will make our dataset and code publicly available.

Paper Structure

This paper contains 39 sections, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Illustration of our approach. Given a sample from an existing dataset which LLaVA answers incorrectly, we prompt a frontier model to analyze LLaVA's reasoning failures and propose new synthetic samples which require similar types of reasoning.
  • Figure 2: Comparison of fully synthetic similar and non-similar samples. Similar samples maintain a children's characters-based theme like the original sample, while non-similar samples address the failure modes by introducing diverse contexts.
  • Figure 3: Examples of generated synthetic question-answer pairs for real images from VizWiz, InfoVQA, and ScienceQA.
  • Figure 4: Prompt used to generate fully synthetic image-text samples based on the failure modes of an LMM (Method 2).
  • Figure 5: Filtering prompt
  • ...and 10 more figures