Table of Contents
Fetching ...

ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

Krishna Teja Chitty-Venkata, Murali Emani

TL;DR

ImageNet-Think-250K delivers the largest publicly available multimodal reasoning dataset with explicit thinking traces, built from 250K ImageNet-21k images and 500K thinking-answer sequences generated by two state-of-the-art VLMs. By providing reasoning traces separate from final answers, the work enables researchers to train and evaluate models on both reasoning quality and final output, and it benchmarks five open-source baselines across a comprehensive metric suite. The dataset addresses prior gaps in scale, domain coverage, and reasoning transparency, and its multi-model annotation strategy offers richer diversity in reasoning patterns. This resource enables clearer diagnosis of model failures, supports interpretable VLM development, and lays the groundwork for future research in hierarchical and multi-modal reasoning architectures with broader applicability across vision-language tasks.

Abstract

We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

TL;DR

ImageNet-Think-250K delivers the largest publicly available multimodal reasoning dataset with explicit thinking traces, built from 250K ImageNet-21k images and 500K thinking-answer sequences generated by two state-of-the-art VLMs. By providing reasoning traces separate from final answers, the work enables researchers to train and evaluate models on both reasoning quality and final output, and it benchmarks five open-source baselines across a comprehensive metric suite. The dataset addresses prior gaps in scale, domain coverage, and reasoning transparency, and its multi-model annotation strategy offers richer diversity in reasoning patterns. This resource enables clearer diagnosis of model failures, supports interpretable VLM development, and lays the groundwork for future research in hierarchical and multi-modal reasoning architectures with broader applicability across vision-language tasks.

Abstract

We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

Paper Structure

This paper contains 28 sections, 1 equation, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Organization of our ImageNet-Think-250K dataset. Each sample consists of an input question ("Please analyze this image step by step..."), followed by multiple rounds of thinking tokens (Think 1, Think 2, ...), where the model produces intermediate reasoning steps describing objects, context, and relationships. These are then paired with corresponding answer tokens (Answer 1, Answer 2, ...), which provide refined explanations or final interpretations of the image. The figure illustrates two examples: (a) a traditional stone-milling setup, and (b) a large reptile (turtle/tortoise). This organization highlights how our dataset explicitly separates reasoning traces from final answers, enabling the evaluation of both reasoning quality and outcome accuracy.
  • Figure 2: An example of a sample image and reasoning instance from the ImageNet-Think dataset. Each instance consists of a Question that prompts the model to analyze the image step by step and provide detailed reasoning. The Think section illustrates the reasoning trace, where the model generates step-by-step inferences, observations, and contextual details about the scene; these intermediate outputs are referred to as thinking tokens. Following this, the dataset records multiple Answer fields that represent the final summarized outputs, which condense the reasoning into concise descriptions or conclusions. This structure captures both the process (thinking tokens) and the outcome (final answers), enabling explicit evaluation of reasoning quality in addition to correctness of the end prediction. The example shown demonstrates how models explain visual details (attire, equipment, target setting, and actions) before producing coherent, task-relevant answers. Such instances highlight the dataset’s ability to disentangle step-by-step reasoning from final responses, making it valuable for studying reasoning quality, interpretability, and multimodal chain-of-thought behaviors.