Table of Contents
Fetching ...

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin

TL;DR

This work identifies pronounced brittleness in state-of-the-art multimodal LLMs when facing common image perturbations and proposes CodeVision, a code-as-tool framework where the model generates code to call arbitrary image operations. The authors employ a two-stage training regime—SFT on a carefully constructed, multi-turn dataset and RL with a dense reward function—to cultivate robust, strategic tool use, emergent tool chaining, and error recovery. They introduce new benchmarks (OCRBench, ChartQAPro, MVToolBench) and demonstrate significant gains over strong baselines, including improved orientation handling and superior multi-tool reasoning. The approach reveals emergent capabilities and offers a scalable path toward more capable visual agents that treat reasoning as programmable interaction with images.

Abstract

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

TL;DR

This work identifies pronounced brittleness in state-of-the-art multimodal LLMs when facing common image perturbations and proposes CodeVision, a code-as-tool framework where the model generates code to call arbitrary image operations. The authors employ a two-stage training regime—SFT on a carefully constructed, multi-turn dataset and RL with a dense reward function—to cultivate robust, strategic tool use, emergent tool chaining, and error recovery. They introduce new benchmarks (OCRBench, ChartQAPro, MVToolBench) and demonstrate significant gains over strong baselines, including improved orientation handling and superior multi-tool reasoning. The approach reveals emergent capabilities and offers a scalable path toward more capable visual agents that treat reasoning as programmable interaction with images.

Abstract

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

Paper Structure

This paper contains 18 sections, 3 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Diagnostic results on image orientation identification.
  • Figure 2: Three advantages of CodeVision we observe in the training and inference stage.
  • Figure 3: Pipeline for cold-start SFT data construction.
  • Figure 4: Rollout and inference process, and token masking used during SFT/RL.
  • Figure 5: RL training curves for outcome, strategy, and total rewards. The consistent upward trend demonstrates that the agent effectively learns to use tools strategically to solve tasks.
  • ...and 12 more figures