Table of Contents
Fetching ...

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, Xihui Liu

TL;DR

CodePlot-CoT introduces a code-driven visual reasoning paradigm for mathematics, replacing pixel-based image generation with executable plotting code that serves as precise visual thoughts during problem solving. The approach is trained on Math-VR, a large bilingual dataset for math visual reasoning, and uses MatplotCode to map figures to plotting code, enabling robust, controllable reasoning that interleaves natural language with code-driven visuals. Empirical results show up to a 21% improvement over strong baselines on Math-VR, with favorable comparisons to both text-only and direct image-generation methods, and the authors provide dataset, code, and pretrained models publicly. This work establishes a new direction for multimodal mathematical reasoning by combining structured code-based representations with vision-language models to address problems that require visual aids like auxiliary lines and function plots, with practical implications for improved math problem solving and education.

Abstract

Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

TL;DR

CodePlot-CoT introduces a code-driven visual reasoning paradigm for mathematics, replacing pixel-based image generation with executable plotting code that serves as precise visual thoughts during problem solving. The approach is trained on Math-VR, a large bilingual dataset for math visual reasoning, and uses MatplotCode to map figures to plotting code, enabling robust, controllable reasoning that interleaves natural language with code-driven visuals. Empirical results show up to a 21% improvement over strong baselines on Math-VR, with favorable comparisons to both text-only and direct image-generation methods, and the authors provide dataset, code, and pretrained models publicly. This work establishes a new direction for multimodal mathematical reasoning by combining structured code-based representations with vision-language models to address problems that require visual aids like auxiliary lines and function plots, with practical implications for improved math problem solving and education.

Abstract

Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.

Paper Structure

This paper contains 34 sections, 1 equation, 15 figures, 5 tables.

Figures (15)

  • Figure 1: A comparison of mathematical reasoning benchmarks and the methods on the visual reasoning problem. (1) illustrates that unlike existing benchmarks that rely on textual reasoning, Math-VR requires deep visual reasoning to resolve the math problems. (2) shows that on a visually ambiguous problem from Math-VR, both text-only and unified multimodal models fail. Our method, CodePlot-CoT, succeeds by programmatically generating the figure to uncover its true geometric properties, thus arriving at the correct solution.
  • Figure 1: Key Statistics for Math-VR Benchmark. We report statistics of our benchmark, including token lengths of questions and solutions, as well as the number and resolution of images.
  • Figure 2: Visualization of Math-VR sample.
  • Figure 3: Distribution of Knowledge Types in Math-VR Benchmark. Geometry constitutes the majority of problems (77%), with Algebra and Calculus comprising 13%.
  • Figure 4: Math-VR Evaluation Pipeline. We design a VLM-based framework to comprehensively assess visual reasoning abilities of different models. The evaluation uses two metrics: Answer Correctness (AC), which gives a reliable binary judgment of the final answer, and Process Score (PS), which provides a fine-grained assessment of the solving process.
  • ...and 10 more figures