Table of Contents
Fetching ...

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

Zhengzhuo Xu, SiNan Du, Yiyan Qi, SiwenLu, Chengjin Xu, Chun Yuan, Jian Guo

TL;DR

This work introduces PointCoT, a grounding-reflection mechanism that ties chain-of-thought reasoning to chart content by generating bounding boxes and re-rendered visualizations. It creates ChartPoint-SFT-62k through an automated pipeline that labels grounding steps, edits plotting code, renders charts, and extracts precise positions, enabling bounding-box supervision for reasoning. Two instruction-tuned models, ChartPoint_Q2 and ChartPoint_Q2.5, are trained in a two-stage process and demonstrate state-of-the-art performance on ChartBench and ChartQA benchmarks, especially in settings with sparse text. The approach highlights the importance of visual grounding for reliable chart understanding and provides a scalable data-generation pipeline and training recipe for grounding-aware multimodal reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K high-quality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04\% on ChartBench.

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

TL;DR

This work introduces PointCoT, a grounding-reflection mechanism that ties chain-of-thought reasoning to chart content by generating bounding boxes and re-rendered visualizations. It creates ChartPoint-SFT-62k through an automated pipeline that labels grounding steps, edits plotting code, renders charts, and extracts precise positions, enabling bounding-box supervision for reasoning. Two instruction-tuned models, ChartPoint_Q2 and ChartPoint_Q2.5, are trained in a two-stage process and demonstrate state-of-the-art performance on ChartBench and ChartQA benchmarks, especially in settings with sparse text. The approach highlights the importance of visual grounding for reliable chart understanding and provides a scalable data-generation pipeline and training recipe for grounding-aware multimodal reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K high-quality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04\% on ChartBench.

Paper Structure

This paper contains 24 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison between vanilla CoT and proposed CoT with bounding box reflection on Qwen2-VL Qwen2VL. Vanilla CoT fails to introduce visual-level reflections. We re-render the generated BBox on the query chart to verify area focus and successfully improve the precision of the extracted numbers.
  • Figure 2: Chain of thought step generation based on plot code.
  • Figure 3: The pipeline of code editing with grounding steps.
  • Figure 4: The process pipeline for constructing instruction data. The red / green indicates the instruction prompt / ground truth.
  • Figure 5: Statistic information of ChartPoint-SFT-62k. Left: Statistics on the number of CoT steps w.r.t. grounding, reasoning, and total steps. Right: chart type distribution.
  • ...and 3 more figures