Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation
Minggui He, Mingchen Dai, Jian Zhang, Yilun Liu, Shimin Tao, Pufan Zeng, Osamu Yoshie, Yuya Ieiri
TL;DR
This work introduces Chart Specification, a structured intermediate representation that encodes chart topology, data bindings, and runtime-derived numerics to align chart understanding with executable plotting code. By constructing ChartStruct, a structurally balanced training corpus, and employing a Spec-Align Reward within a reinforcement learning framework, the approach delivers dense, verifiable feedback that steers models toward faithful plotting logic. Across ChartMimic, Plot2Code, and ChartX benchmarks, the method achieves state-of-the-art results with strong data efficiency (notably with only 3K–4K training samples) and robust generalization to complex chart types. The combination of structure-aware supervision, runtime data grounding, and a hierarchical reward mechanism offers a scalable path to high-fidelity chart-to-code generation with practical impact for reproducibility and automated chart tooling.
Abstract
Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: https://github.com/Mighten/chart-specification-paper
