VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen
TL;DR
This work addresses the limited availability of reasoning-focused multimodal data by proposing VisualWebInstruct, a scalable pipeline that mines the web using 30k seed images to assemble ~1.04M consistency-verified QA pairs across multiple disciplines. By refining data with GPT-4o and alignment with web content, the dataset emphasizes high-quality, diverse visual and text-based reasoning prompts. Fine-tuning a 7B-parameter model (MAmmoTH-VL2) on VisualWebInstruct yields strong visual reasoning across seven benchmarks, achieving state-of-the-art results among open-source models and competitive performance with larger or proprietary systems. The dataset and model demonstrate the practical potential of web-scale multimodal data for advancing reasoning in vision-language models, with plans to expand data via additional search rounds.
Abstract
Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. In this work, we aim to address the scarcity of reasoning-focused multimodal datasets. We propose VisualWebInstruct, a novel approach that leverages search engines to create a diverse and high-quality dataset spanning multiple disciplines, including mathematics, physics, finance, and chemistry, etc. Starting with a meticulously selected set of 30,000 seed images, we employ Google Image Search to identify websites containing similar images. We collect and process HTML data from over 700K unique URLs. Through a pipeline of content extraction, filtering, and synthesis, we construct a dataset of approximately 900K question-answer (QA) pairs, with 40% consisting of visual QA pairs and the remaining comprising text-based QA pairs. Models fine-tuned on VisualWebInstruct demonstrate significant performance improvements: (1) fine-tuning on Llava-OV results in 10-20 absolute points improvement across benchmarks, and (2) fine-tuning from MAmmoTH-VL yields a 5 absolute points gain across benchmarks. Our best model, MAmmoTH-VL2, achieves state-of-the-art performance within the 10B parameter class on MMMU-Pro (40.7), MathVerse (42.6), and DynaMath (55.7). These results highlight the effectiveness of our dataset in enhancing the reasoning capabilities of vision-language models for complex multimodal tasks.
