Table of Contents
Fetching ...

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li

TL;DR

<3-5 sentence high-level summary> Multi-modal chain-of-thought (CoT) for mathematics is hampered by coarse visual cues and weak math-specific perception. MINT-CoT introduces an Interleave Token that adaptively grounds and interleaves fine-grained visual tokens into each CoT step, backed by a 54K visual interleaved CoT dataset and a three-stage training strategy that includes supervised and reinforcement learning. Empirical results on MathVista, GeoQA, and MMStar show substantial gains over baselines and competitive performance with state-of-the-art open-source reasoning models, demonstrating effective token-level visual grounding for mathematical reasoning. The work also provides an automated data-generation pipeline and a scalable framework for grounding visual evidence in visual-text reasoning tasks.

Abstract

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

TL;DR

<3-5 sentence high-level summary> Multi-modal chain-of-thought (CoT) for mathematics is hampered by coarse visual cues and weak math-specific perception. MINT-CoT introduces an Interleave Token that adaptively grounds and interleaves fine-grained visual tokens into each CoT step, backed by a 54K visual interleaved CoT dataset and a three-stage training strategy that includes supervised and reinforcement learning. Empirical results on MathVista, GeoQA, and MMStar show substantial gains over baselines and competitive performance with state-of-the-art open-source reasoning models, demonstrating effective token-level visual grounding for mathematical reasoning. The work also provides an automated data-generation pipeline and a scalable framework for grounding visual evidence in visual-text reasoning tasks.

Abstract

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT

Paper Structure

This paper contains 35 sections, 12 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Comparison of three CoT reasoning methods: text-only CoT reasoning, box-shaped visual CoT reasoning and our visual interleaved CoT reasoning methods. (1) Text-only CoT lacks visual information, causing perception errors in mathematical reasoning. (2) Box-level cues are too coarse to capture complex visual structures in mathematical images. (3) Token-level interleaved CoT accurately identifies fine-grained visual regions to support reasoning.
  • Figure 2: Overview of the MINT-CoT framework. During CoT reasoning, MINT-CoT generates an Interleave Token before each reasoning step and computes the similarity scores between embeddings projected by the decoder-side visual projector and the interleave projector. Based on these similarity scores, relevant visual tokens are selected, and the model inferences with these selected visual tokens.
  • Figure 3: Data generation pipline.Step 1: Grid Images. We divide each image into grid cells and assign index values to each cell. Step 2: Apply OCR. We use PaddleOCR to recognize textual elements and associate them with corresponding grid indices. Step 3: Extract Key Words. We employ GPT-4o to extract key words from each reasoning step. Step 4: Align and Annotate Key Words. We use GPT-4o to annotate each key word with the grid indices, and get the final visual interleaved CoT reasoning steps.
  • Figure 4: F1 score plot of visual token selection during Interleaved CoT SFT.
  • Figure 5: Qualitative results of Qwen2-VL-7B-Instruct and MINT-CoT-7B. MINT-CoT-7B demonstrates improved CoT reasoning capability by interleaving fine-grained visual tokens. There is also a visualization of the similarity scores for the Interleaved Token generated during Step 4.
  • ...and 6 more figures