Table of Contents
Fetching ...

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

Muye Huang, Lingling Zhang, Jie Ma, Han Lai, Fangzhi Xu, Yifei Li, Wenjun Wu, Yaqiang Wu, Jun Liu

TL;DR

ChartSketcher introduces a multimodal, feedback-driven framework for chart understanding that combines a programmatic sketching library with Sketch-CoT reasoning and a reflection loop. It employs a two-stage training pipeline—cold start via cross-modal distillation and an off-policy RL phase using Sketch-MCTS—to ground visual reasoning and improve generalization. Empirical results across 18 datasets show improvements on chart-specific benchmarks while maintaining strong performance on general vision tasks, with enhanced interpretability through on-image sketches. The approach underscores the value of explicit visual grounding and self-reflection for accurate data extraction from complex charts.

Abstract

Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.

ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding

TL;DR

ChartSketcher introduces a multimodal, feedback-driven framework for chart understanding that combines a programmatic sketching library with Sketch-CoT reasoning and a reflection loop. It employs a two-stage training pipeline—cold start via cross-modal distillation and an off-policy RL phase using Sketch-MCTS—to ground visual reasoning and improve generalization. Empirical results across 18 datasets show improvements on chart-specific benchmarks while maintaining strong performance on general vision tasks, with enhanced interpretability through on-image sketches. The approach underscores the value of explicit visual grounding and self-reflection for accurate data extraction from complex charts.

Abstract

Charts are high-density visualization carriers for complex data, serving as a crucial medium for information extraction and analysis. Automated chart understanding poses significant challenges to existing multimodal large language models (MLLMs) due to the need for precise and complex visual reasoning. Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding. However, they struggle to refine or correct their reasoning when errors stem from flawed visual understanding, as they lack the ability to leverage multimodal interaction for deeper comprehension. Inspired by human cognitive behavior, we propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations. ChartSketcher is a chart understanding model that employs Sketch-CoT, enabling MLLMs to annotate intermediate reasoning steps directly onto charts using a programmatic sketching library, iteratively feeding these visual annotations back into the reasoning process. This mechanism enables the model to visually ground its reasoning and refine its understanding over multiple steps. We employ a two-stage training strategy: a cold start phase to learn sketch-based reasoning patterns, followed by off-policy reinforcement learning to enhance reflection and generalization. Experiments demonstrate that ChartSketcher achieves promising performance on chart understanding benchmarks and general vision tasks, providing an interactive and interpretable approach to chart comprehension.

Paper Structure

This paper contains 29 sections, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: The overview of the proposed ChartSketcher. Dashed lines indicate intermediate reasoning and reflection processes, with corresponding sketch outputs shown for each step.
  • Figure 2: Overview of ChartSketcher Training Process. The upper part illustrates the cold start phase, focusing on knowledge distillation and pattern learning. The lower part shows the offline RL optimization process, which is conducted on diverse datasets. In the figure, <code> indicates that ChartSketcher is calling the Programmatic Sketching Library to draw. When ChartSketcher no longer outputs <code>, it indicates that the reasoning process has ended.
  • Figure 3: Analysis of CoT length and the number of rethink iterations for both correctly and incorrectly answered questions across all listed datasets. Datasets listed on the left (blue font) are chart-specific benchmarks, while those on the right (brown font) represent general image datasets.
  • Figure 4: Four cases for ChartSketcher. The drawing code associated with each step is omitted for clarity. Arrows indicate the visual outputs generated by specific reasoning steps. Semi-transparent elements represent outputs that were subsequently corrected or erased by later steps.
  • Figure 5: An illustrative example of a challenge in current MLLMs
  • ...and 4 more figures