Table of Contents
Fetching ...

Simple o3: Towards Interleaved Vision-Language Reasoning

Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shijie Guo, Zhirui Zhang, Zhongyu Wei

TL;DR

This work addresses the limited exploration of extended interleaved vision-language reasoning in multimodal models by introducing Simple o3, an end-to-end framework that combines dynamic visual tool interactions with iterative reasoning. It proposes a scalable data synthesis pipeline and the TWI-Tools-146K dataset, uses image masking to stabilize learning, and enables multi-step inference with tools such as focus_area, zoom_in, and reuse. Empirically, Simple o3 achieves substantial gains across multimodal reasoning and perception benchmarks, outperforming strong baselines and RL-based methods, while providing detailed ablations on tool choices and input resolution. The study offers practical guidance for tool selection and data composition in the thinking-with-images paradigm, and points to future directions in expanding tool sets and RL-based training for vision-language action tasks.

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI's o3 model, which emulates human-like ''thinking with image'' through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ''observe-reason-act'' cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3's superior performance on diverse benchmarks, outperforming existing approaches. By combining enhanced reasoning capabilities, Simple o3 establishes a powerful yet computationally affordable paradigm for advancing multimodal reasoning. Remarkably, we provide the first in-depth analysis of different interleaved reasoning strategies, offering insights into their impact on model performance. We found that by introducing additional visual tokens for interleaved vision-language reasoning, reusing and magnifying the original image significantly improves the model's visual reasoning and fine-grained perception, while image cropping based on precise visual grounding allows the model to effectively focus on key entities or regions, further enhancing its capabilities.

Simple o3: Towards Interleaved Vision-Language Reasoning

TL;DR

This work addresses the limited exploration of extended interleaved vision-language reasoning in multimodal models by introducing Simple o3, an end-to-end framework that combines dynamic visual tool interactions with iterative reasoning. It proposes a scalable data synthesis pipeline and the TWI-Tools-146K dataset, uses image masking to stabilize learning, and enables multi-step inference with tools such as focus_area, zoom_in, and reuse. Empirically, Simple o3 achieves substantial gains across multimodal reasoning and perception benchmarks, outperforming strong baselines and RL-based methods, while providing detailed ablations on tool choices and input resolution. The study offers practical guidance for tool selection and data composition in the thinking-with-images paradigm, and points to future directions in expanding tool sets and RL-based training for vision-language action tasks.

Abstract

Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI's o3 model, which emulates human-like ''thinking with image'' through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ''observe-reason-act'' cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3's superior performance on diverse benchmarks, outperforming existing approaches. By combining enhanced reasoning capabilities, Simple o3 establishes a powerful yet computationally affordable paradigm for advancing multimodal reasoning. Remarkably, we provide the first in-depth analysis of different interleaved reasoning strategies, offering insights into their impact on model performance. We found that by introducing additional visual tokens for interleaved vision-language reasoning, reusing and magnifying the original image significantly improves the model's visual reasoning and fine-grained perception, while image cropping based on precise visual grounding allows the model to effectively focus on key entities or regions, further enhancing its capabilities.

Paper Structure

This paper contains 19 sections, 6 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Overview of Simple o3. At Step 0, the blue text represents the atomic reasoning step $r_0$, while the green text represents the visual operation plan $p_0$. These two components constitute the reasoning content $R_0$. The pink text represents the tool instruction $C_0$, which is returned as JSON object. During training, $focus\_area$ operation is followed by an image with the bbox to obtain the complete visual information of the image. During inference, the image is cropped according to the coordinates returned by $focus\_area$ to inject visual tokens of target entities or regions.
  • Figure 2: Overview of scalable data generation pipeline. MLLM generates current reasoning content and tool commands at each step. The toolbox processes input images based on these commands, returning manipulated visuals, followed by tool verification to ensure semantic alignment between commands and visual operations. Upon successful verification, the system combines the current step's generation into the history, for the next generation step. This cycle repeats until the answer is generated, concluding with answer verification to produce a complete data sample.
  • Figure 3: Prompt template for the reasoning path generator.
  • Figure 4: Tool definitions in OpenAI's dialogue format.
  • Figure 5: Tool verification prompt when executing $focus\_area$. The model determines whether the input image and the returned coordinates are semantically aligned with the visual operation planning.
  • ...and 8 more figures