Table of Contents
Fetching ...

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

TL;DR

ChartAgent introduces a visually grounded, tool-augmented multimodal agent for chart understanding that iteratively decomposes chart questions into visual subtasks and manipulates the chart image via chart-specific tools. The ReAct-like loop, together with a modular tool library and visual self-verification, enables precise grounding beyond OCR and textual shortcuts, yielding state-of-the-art results on ChartBench and ChartX, including strong gains on unannotated and numerically intensive queries. The framework demonstrates plug-and-play compatibility with multiple base LLMs and offers extensive ablations, recovery analyses, and a detailed failure-mode study to guide future improvements. Together, these contributions advance reliable chart reasoning with interpretable, tool-driven visual reasoning and have practical implications for robust chart QA across diverse chart types.

Abstract

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

TL;DR

ChartAgent introduces a visually grounded, tool-augmented multimodal agent for chart understanding that iteratively decomposes chart questions into visual subtasks and manipulates the chart image via chart-specific tools. The ReAct-like loop, together with a modular tool library and visual self-verification, enables precise grounding beyond OCR and textual shortcuts, yielding state-of-the-art results on ChartBench and ChartX, including strong gains on unannotated and numerically intensive queries. The framework demonstrates plug-and-play compatibility with multiple base LLMs and offers extensive ablations, recovery analyses, and a detailed failure-mode study to guide future improvements. Together, these contributions advance reliable chart reasoning with interpretable, tool-driven visual reasoning and have practical implications for robust chart QA across diverse chart types.

Abstract

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts, those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieve the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

Paper Structure

This paper contains 79 sections, 13 figures, 15 tables.

Figures (13)

  • Figure 1: Comparison of our work with the existing SoTA.(a)ChartAgent performs visually grounded reasoning in the chart domain. For this unannotated chart, GPT-4o fails to produce the correct answer, whereas ChartAgent succeeds. (b)ChartAgent performance on unannotated charts and numeric QA compared with the top-10 SoTA.
  • Figure 2: ChartAgent. The (A) orchestrator extracts chart metadata and routes annotated charts with textual shortcuts and qualitative QA to the base MLLM, while unannotated charts and numeric queries trigger the ReAct-style loop. The system includes (B) a library of universal and chart-specific tools, (C) metadata for parameterizing tool usage and retrieving chart-type-specific ICL examples, and (D) few-shot ICL retrieval. Using these components as the (E) input, ChartAgent performs (F) iterative visual reasoning, supported by (G) visual self-verification of intermediate outputs. When tool-based reasoning is unreliable, (H) the agent falls back to the base MLLM.
  • Figure 3: (a) Left: ChartAgent vs. concurrent works: overall accuracy ($\uparrow$) and average absolute error ($\downarrow$). (b) Right: Effectiveness of visual self-verification: enabled 70% successful recoveries when invoked.
  • Figure 4: Analysis of ChartAgent Performance.(a) Left: Stratified by visual complexity of charts and reasoning complexity of chart–QA pairs on unannotated charts, compared with top-10 SoTA. (b) Middle:ChartAgent performance on unannotated+numeric chartQA when instantiated with different base MLLMs. (c) Right: Ablation study comparing ChartAgent with ReAct using no tools and ReAct with natural image–based generic tools.
  • Figure 5: Examples of annotated (top) vs. unannotated (bottom) charts. An annotated chart contains explicit textual annotations or shortcuts, whereas an unannotated chart lacks such explicit value indicators. For instance, in the first column (top), the bar chart includes printed bar values, while in the corresponding bottom chart, the values must be inferred through visual interpretation.
  • ...and 8 more figures