Table of Contents
Fetching ...

Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

Yuyang Ji, Haohan Wang

TL;DR

This work exposes the fragility of current multimodal LLMs in true visual chart reasoning, showing that removing textual labels can induce large performance drops ($\sim$30%) on ChartQA due to OCR shortcuts. It introduces Socratic Chart, a framework that converts chart images into richly structured Scalable Vector Graphics (SVG) using a multi-agent pipeline with specialized agent-generators and an agent-critic to produce high-fidelity symbolic representations. By fusing these SVGs with visual inputs, the approach achieves state-of-the-art or competitive performance on ChartQA and Charixv benchmarks, including strong robustness under label removal and perturbations. The method offers a principled, interpretable pathway toward robust chart understanding in MLLMs, with potential broad impact on scientific visualization and data interpretation tasks.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

TL;DR

This work exposes the fragility of current multimodal LLMs in true visual chart reasoning, showing that removing textual labels can induce large performance drops (30%) on ChartQA due to OCR shortcuts. It introduces Socratic Chart, a framework that converts chart images into richly structured Scalable Vector Graphics (SVG) using a multi-agent pipeline with specialized agent-generators and an agent-critic to produce high-fidelity symbolic representations. By fusing these SVGs with visual inputs, the approach achieves state-of-the-art or competitive performance on ChartQA and Charixv benchmarks, including strong robustness under label removal and perturbations. The method offers a principled, interpretable pathway toward robust chart understanding in MLLMs, with potential broad impact on scientific visualization and data interpretation tasks.

Abstract

Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

Paper Structure

This paper contains 26 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Performance comparison of previous system on the ChartQA benchmark under two conditions: with original charts (top) and with charts where text labels have been removed (bottom). Removing textual labels eliminates shortcuts that allow models to rely on OCR-based extraction rather than genuine visual reasoning. The significant performance drops observed in these systems like GPT-4v, Gemini2, and SIMPLOT highlight their dependence on these text-based shortcuts when answering questions. In contrast, our proposed method, Socratic Chart, demonstrates a substantially smaller performance drop (23.9%) due to its innovative framework that transforms chart images into Scalable Vector Graphics (SVG) representations. This approach enables multimodal large language models (MLLMs) to more effectively integrate visual information with textual context, resulting in enhanced chart understanding even when explicit textual cues are absent.
  • Figure 2: Overview of the multi-agent collaboration pipeline for transforming chart images into Scalable Vector Graphics (SVG) representations. The process begins with chart type classification, followed by specialized agent-generators extracting semantic and geometric attributes (e.g., bar dimensions, text labels, legends). Agent-critics refine these outputs by identifying and correcting errors. The final merged SVG encodes all chart elements, enabling multimodal large language models (MLLMs) to perform robust zero-shot reasoning tasks like trend analysis and data interpretation.
  • Figure 3: We present a visualization of the SVG code generated using our multi-agent pipeline, demonstrating its ability to assist MLLMs in answering questions that GPT-4v fails to address correctly. The SVG representation encodes the geometric and semantic attributes of the chart, such as bar heights, line coordinates, and text labels, in a structured format. This high-fidelity representation enables MLLMs to perform precise reasoning tasks, such as extracting specific data points or identifying trends, even in complex charts where traditional models like GPT-4v may struggle due to their reliance on visual estimation. By leveraging the symbolic abstraction provided by the SVG format, our framework ensures accurate and interpretable chart understanding.