ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models
Krishna Teja Chitty-Venkata, Murali Emani
TL;DR
ImageNet-Think-250K delivers the largest publicly available multimodal reasoning dataset with explicit thinking traces, built from 250K ImageNet-21k images and 500K thinking-answer sequences generated by two state-of-the-art VLMs. By providing reasoning traces separate from final answers, the work enables researchers to train and evaluate models on both reasoning quality and final output, and it benchmarks five open-source baselines across a comprehensive metric suite. The dataset addresses prior gaps in scale, domain coverage, and reasoning transparency, and its multi-model annotation strategy offers richer diversity in reasoning patterns. This resource enables clearer diagnosis of model failures, supports interpretable VLM development, and lays the groundwork for future research in hierarchical and multi-modal reasoning architectures with broader applicability across vision-language tasks.
Abstract
We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.
