Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Yuyang Ji, Haohan Wang
TL;DR
This work exposes the fragility of current multimodal LLMs in true visual chart reasoning, showing that removing textual labels can induce large performance drops ($\sim$30%) on ChartQA due to OCR shortcuts. It introduces Socratic Chart, a framework that converts chart images into richly structured Scalable Vector Graphics (SVG) using a multi-agent pipeline with specialized agent-generators and an agent-critic to produce high-fidelity symbolic representations. By fusing these SVGs with visual inputs, the approach achieves state-of-the-art or competitive performance on ChartQA and Charixv benchmarks, including strong robustness under label removal and perturbations. The method offers a principled, interpretable pathway toward robust chart understanding in MLLMs, with potential broad impact on scientific visualization and data interpretation tasks.
Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.
