Table of Contents
Fetching ...

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen

TL;DR

This work addresses the limited availability of reasoning-focused multimodal data by proposing VisualWebInstruct, a scalable pipeline that mines the web using 30k seed images to assemble ~1.04M consistency-verified QA pairs across multiple disciplines. By refining data with GPT-4o and alignment with web content, the dataset emphasizes high-quality, diverse visual and text-based reasoning prompts. Fine-tuning a 7B-parameter model (MAmmoTH-VL2) on VisualWebInstruct yields strong visual reasoning across seven benchmarks, achieving state-of-the-art results among open-source models and competitive performance with larger or proprietary systems. The dataset and model demonstrate the practical potential of web-scale multimodal data for advancing reasoning in vision-language models, with plans to expand data via additional search rounds.

Abstract

Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search

TL;DR

This work addresses the limited availability of reasoning-focused multimodal data by proposing VisualWebInstruct, a scalable pipeline that mines the web using 30k seed images to assemble ~1.04M consistency-verified QA pairs across multiple disciplines. By refining data with GPT-4o and alignment with web content, the dataset emphasizes high-quality, diverse visual and text-based reasoning prompts. Fine-tuning a 7B-parameter model (MAmmoTH-VL2) on VisualWebInstruct yields strong visual reasoning across seven benchmarks, achieving state-of-the-art results among open-source models and competitive performance with larger or proprietary systems. The dataset and model demonstrate the practical potential of web-scale multimodal data for advancing reasoning in vision-language models, with plans to expand data via additional search rounds.

Abstract

Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.

Paper Structure

This paper contains 25 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our automated data curation approach and major experimental results.
  • Figure 2: Comprehensive Pipeline for VisualWebInstruct Dataset Generation. The workflow illustrates our multi-stage approach for creating high-quality multimodal instruction data. Stage 1: starting with seed images, we leverage Google Image search to identify relevant webpages, which are processed into accessibility trees. The raw QA pairs are extracted from the trees and refined through a post-processing step to ensure the vadality the data. Stage 2: we first generat multiple synthesized answers for consistency filtering, then align these with original web-sourced content to enhance the accuracy of the answers.
  • Figure 3: Example of Google Lens search functionality for circle geometry problems.
  • Figure 4: Example of an accessibility tree structure extracted from an educational website.
  • Figure 5: Illustration of our consistency checking methodology.
  • ...and 2 more figures