Table of Contents
Fetching ...

ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

Duo Xu, Hao Cheng, Xin Lin, Zhen Xie, Hao Wang

TL;DR

ChartM3 introduces a multi-stage, code-driven pipeline to synthesize high-quality visual reasoning data for chart comprehension, combining retrieval-augmented template selection, executable data/code generation, and verifiable reasoning paths. The dataset encompasses 38.4K charts and 142K Q&A pairs with 2,871 evaluation samples, enabling supervised fine-tuning and reinforcement learning with verifiable rewards. Experimental results show that CoT-based fine-tuning and GRPO-style RL significantly enhance reasoning and cross-domain generalization, allowing smaller models to approach larger ones on complex chart tasks. The work bridges a gap between benchmark-focused evaluation and real-world chart understanding, offering a scalable resource and methodology for advancing reasoning-enabled multimodal models.

Abstract

Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

TL;DR

ChartM3 introduces a multi-stage, code-driven pipeline to synthesize high-quality visual reasoning data for chart comprehension, combining retrieval-augmented template selection, executable data/code generation, and verifiable reasoning paths. The dataset encompasses 38.4K charts and 142K Q&A pairs with 2,871 evaluation samples, enabling supervised fine-tuning and reinforcement learning with verifiable rewards. Experimental results show that CoT-based fine-tuning and GRPO-style RL significantly enhance reasoning and cross-domain generalization, allowing smaller models to approach larger ones on complex chart tasks. The work bridges a gap between benchmark-focused evaluation and real-world chart understanding, offering a scalable resource and methodology for advancing reasoning-enabled multimodal models.

Abstract

Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

Paper Structure

This paper contains 21 sections, 2 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Left: ChartM$^3$ covers 9 major categories of chart types, totaling 62 subcategories. Right: Performance comparison of representative MLLMs across ChartM$^3$ task categories.
  • Figure 2: The ChartM$^3$ data generation pipeline follows a progressive automated workflow that begins by generating key questions and utilizing RAG to select appropriate templates from a diverse chart database. The process then advances through multiple code-driven stages: creating structured data, producing rendering code, and generating Q&A pairs with multi-step visual reasoning reasoning synthesizing analytical code. Finally the pipeline conducts model-based comprehensive assessments of data quality and difficulty levels.
  • Figure 3: Performance comparison between models trained by SFT with and without CoT Q&A across different evaluation metrics.
  • Figure 4: Performance of models trained by GRPO with different numbers of samples across multiple datasets.
  • Figure 5: The distribution of ChartM$^3$ Q&A categories.
  • ...and 9 more figures