Table of Contents
Fetching ...

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Yufeng Zhong, Lin Ma

TL;DR

Chart-R1 addresses the challenge of complex reasoning on charts by combining a novel programmatic data synthesis pipeline with a two-stage training regime (Chart-COT and Chart-RFT) to create and leverage reasoning data for chart-domain VLMs. It introduces ChartRQA, a large-scale, code-grounded training corpus (258k samples) plus a human-verified benchmark for multi-chart reasoning, derived from real-world arXiv tables. The core RL method, GRPO-based Chart-RFT, uses a soft accuracy reward and edit-distance format checks to optimize both numerical and textual outputs. Empirically, Chart-R1 achieves state-of-the-art performance among small (<20B) VLMs on chart reasoning benchmarks and attains competitive results with open/closed large-scale models on chart reasoning tasks.

Abstract

Recently, inspired by OpenAI-o1/o3 and Deepseek-R1, the R1-Style method based on reinforcement learning fine-tuning has received widespread attention from the community. Previous R1-Style methods mainly focus on mathematical reasoning and code intelligence. It is of great research significance to verify their advantages on more general multimodal data. Chart is an important multimodal data type with rich information, which brings important research challenges in complex reasoning. In this work, we introduce Chart-R1, a chart-domain vision-language model with reinforcement learning fine-tuning to enable complex chart reasoning. To support Chart-R1, we first propose a novel programmatic data synthesis technology to generate high-quality step-by-step chart reasoning data covering single- and multi-subcharts, which makes up for the lack of reasoning data in the chart domain. Then we develop a two-stage training strategy: Chart-COT with step-by-step chain-of-thought supervision, and Chart-RFT with numerically sensitive reinforcement fine-tuning. Chart-COT aims to decompose complex chart reasoning tasks into fine-grained, understandable subtasks through step-by-step supervision, which lays a good foundation for improving the reasoning level of reinforcement learning. Chart-RFT utilize the typical group relative policy optimization strategy, in which a relatively soft reward is adopted for numerical response to emphasize the numerical sensitivity in the chart domain. We conduct extensive experiments on open-source benchmarks and self-built chart reasoning dataset (\emph{i.e., ChartRQA}). Experimental results show that Chart-R1 has significant advantages compared to chart-domain methods, even comparable to open/closed source large-scale models (\emph{e.g., GPT-4o, Claude-3.5}).

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

TL;DR

Chart-R1 addresses the challenge of complex reasoning on charts by combining a novel programmatic data synthesis pipeline with a two-stage training regime (Chart-COT and Chart-RFT) to create and leverage reasoning data for chart-domain VLMs. It introduces ChartRQA, a large-scale, code-grounded training corpus (258k samples) plus a human-verified benchmark for multi-chart reasoning, derived from real-world arXiv tables. The core RL method, GRPO-based Chart-RFT, uses a soft accuracy reward and edit-distance format checks to optimize both numerical and textual outputs. Empirically, Chart-R1 achieves state-of-the-art performance among small (<20B) VLMs on chart reasoning benchmarks and attains competitive results with open/closed large-scale models on chart reasoning tasks.

Abstract

Recently, inspired by OpenAI-o1/o3 and Deepseek-R1, the R1-Style method based on reinforcement learning fine-tuning has received widespread attention from the community. Previous R1-Style methods mainly focus on mathematical reasoning and code intelligence. It is of great research significance to verify their advantages on more general multimodal data. Chart is an important multimodal data type with rich information, which brings important research challenges in complex reasoning. In this work, we introduce Chart-R1, a chart-domain vision-language model with reinforcement learning fine-tuning to enable complex chart reasoning. To support Chart-R1, we first propose a novel programmatic data synthesis technology to generate high-quality step-by-step chart reasoning data covering single- and multi-subcharts, which makes up for the lack of reasoning data in the chart domain. Then we develop a two-stage training strategy: Chart-COT with step-by-step chain-of-thought supervision, and Chart-RFT with numerically sensitive reinforcement fine-tuning. Chart-COT aims to decompose complex chart reasoning tasks into fine-grained, understandable subtasks through step-by-step supervision, which lays a good foundation for improving the reasoning level of reinforcement learning. Chart-RFT utilize the typical group relative policy optimization strategy, in which a relatively soft reward is adopted for numerical response to emphasize the numerical sensitivity in the chart domain. We conduct extensive experiments on open-source benchmarks and self-built chart reasoning dataset (\emph{i.e., ChartRQA}). Experimental results show that Chart-R1 has significant advantages compared to chart-domain methods, even comparable to open/closed source large-scale models (\emph{e.g., GPT-4o, Claude-3.5}).

Paper Structure

This paper contains 19 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Comparison of existing VLM performance on chart understanding and reasoning benchmarks. In the complex chart reasoning task, Qwen2.5-VL-7B generates a wrong thinking process, whereas Chart-R1 thinks and answers correctly.
  • Figure 2: Showcases of our proposed ChartRQA dataset. The ChartRQA include single- and multi-chart images with complex questions that need step-by-step thinking processes to answer.
  • Figure 3: The training curve during the RL stage that utilizing the ChartQA dataset solely.
  • Figure 4: Training curves for the RL stage using the ChartQA and ChartRQA datasets. The orange curve represents our proposed two-stage training strategy, while the blue curve corresponds to a baseline RL-only setting.
  • Figure 5: The visualization result of a case study that Chart-R1 (w/ and w/o Chart-COT) answer correctly, but Qwen2.5VL-7B fails.
  • ...and 7 more figures