Table of Contents
Fetching ...

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, Guorui Zhou

TL;DR

Thyme presents a novel paradigm that enables multimodal LLMs to autonomously generate and execute code for rich image manipulations and computations, pushing beyond traditional think-with-images methods. The approach combines a two-stage training regime (SFT on a 500K-sample dataset and reinforcement learning with GRPO-ATS) and a secure sandbox to balance reasoning with precise code execution. Empirical results across nearly 20 benchmarks show consistent gains in high-resolution perception and complex reasoning tasks, with ablations highlighting the value of masking, final-round training, and consistency-aware RL rewards. The work emphasizes practical efficiency, releasing dataset, sandbox, and code to accelerate community adoption, while acknowledging limitations in base-model capacity and benchmark coverage. Thyme thus offers a scalable path to richer, tool-enabled multimodal reasoning in real-world tasks.

Abstract

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Thyme: Think Beyond Images

TL;DR

Thyme presents a novel paradigm that enables multimodal LLMs to autonomously generate and execute code for rich image manipulations and computations, pushing beyond traditional think-with-images methods. The approach combines a two-stage training regime (SFT on a 500K-sample dataset and reinforcement learning with GRPO-ATS) and a secure sandbox to balance reasoning with precise code execution. Empirical results across nearly 20 benchmarks show consistent gains in high-resolution perception and complex reasoning tasks, with ablations highlighting the value of masking, final-round training, and consistency-aware RL rewards. The work emphasizes practical efficiency, releasing dataset, sandbox, and code to accelerate community adoption, while acknowledging limitations in base-model capacity and benchmark coverage. Thyme thus offers a scalable path to richer, tool-enabled multimodal reasoning in real-world tasks.

Abstract

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code. In this paper, we make a preliminary attempt in this direction by introducing Thyme (Think Beyond Images), a novel paradigm for enabling MLLMs to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code. This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement) but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial SFT on a curated dataset of 500K samples to teach code generation, followed by a RL phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose GRPO-ATS (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. Comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Paper Structure

This paper contains 40 sections, 7 equations, 16 figures, 9 tables.

Figures (16)

  • Figure 1: Benchmark performance of Thyme. The comprehensive set of image manipulation capabilities enables Thyme to achieve significant improvements over the baseline in perception tasks. By leveraging its ability to convert complex mathematical reasoning into executable code, it consistently outperforms baselines in mathematical reasoning benchmarks. Furthermore, the observed gains across a wide range of general benchmarks further validate the effectiveness of our training approach.
  • Figure 2: Overall pipeline of the Thyme, illustrating the interaction between the model and the sandbox for iterative reasoning and code execution. Key processes such as reasoning, code generation, sandbox execution, and result feedback are highlighted.
  • Figure 3: SFT Data Construction Pipeline. First, samples are taken from an existing dataset and prompts are constructed based on the target functions (such as cropping, rotating, etc.). The model generates a thinking process and corresponding code based on the prompt. The code is then executed in a sandbox environment to filter out samples that fail to run properly. The remaining samples are reviewed by an additional MLLM to verify whether the code execution results align with the thinking process and effectively answer the question, eliminating invalid code samples. Finally, manual review is conducted to remove low-quality samples, ensuring the quality of the cold-start dataset.
  • Figure 4: Visualization of SFT Data instances. The left side presents a sample of data related to image processing operations, while the right side showcases a sample of data related to complex computations. During the training phase, the model autonomously generates code based on the analysis process, enabling the execution of desired image processing or computational tasks. This capability enhances the quality of perception and reasoning processes.
  • Figure 5: Visualization of RL Data Instances. Thyme RL data focuses on complex scenarios and high-resolution image interpretation. Human annotators identify challenging objects within the images, design corresponding questions, and provide answers, along with the appropriate bounding boxes.
  • ...and 11 more figures