Table of Contents
Fetching ...

RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu, Leqi Zheng, Yiran Yang, Jianke Zhang, Qingbin Li, Shannan Yan, Zhetong Li, Changguo Jia, Junfei Wu, Zilei Wang, Qiang Liu, Liang Wang

Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{https://github.com/Speakn0w/RealChart2Code}.

RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{https://github.com/Speakn0w/RealChart2Code}.

Paper Structure

This paper contains 64 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: A real-world example illustrating the limitations of LLMs on complex chart-to-code tasks. When presented with a sophisticated request, the model generates a low-quality output and struggles to make effective improvements during the subsequent refinement process.
  • Figure 2: The three core tasks of the RealChart2Code benchmark. Chart Replication is the fundamental chart-to-code task. Chart Reproduction adds the challenge of using provided raw data files. Chart Refinement introduces a conversational component, where the model must debug and modify code to fix errors according to user feedback.
  • Figure 3: Dataset distribution across tasks. (a) Image distribution. (b) CSV data distribution. (c) Data length distribution with median (M) and mean ($\mu$) in tokens.
  • Figure 4: Performance breakdown across tasks and metrics for top models. Each radial chart shows scores for eight visual accuracy metrics (Type: chart type consistency, Layout: spatial layout, Text: text elements, Axis: axis configuration, Color: color scheme, Style: style and format, Complete: component completeness, Data: data alignment), execution pass rate (Pass), and three quality metrics (Clarity: visual clarity, Balance: compositional balance, Typo: typographic quality).
  • Figure 5: Score distributions with 95% confidence intervals across evaluation metrics for Claude-4.5-Sonnet and Gemini-3.0-Pro-Preview on chart replication, reproduction, and refinement tasks.
  • ...and 6 more figures