Table of Contents
Fetching ...

PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization

Jiajun Zhang, Jianke Zhang, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin

TL;DR

PlotCraft addresses a critical gap in evaluating LLMs for complex data visualization by introducing a large-scale, multi-turn benchmark (PlotCraft) and a high-quality synthetic dataset (SynthVis-30K). The authors build PlotCraftor, an open-weight model trained via supervised fine-tuning on SynthVis-30K, which achieves state-of-the-art performance among open models and approaches proprietary baselines on hard visualization tasks. The benchmark includes 982–approximately 1k tasks spanning 48 chart types, 31 thematic topics, and three difficulty levels, with a rigorous, sandboxed evaluation pipeline that combines automated judging and human correlation analysis. Collectively, PlotCraft, SynthVis-30K, and PlotCraftor enable robust, scalable assessment and development of LLMs for complex data visualization, with implications for practical visualization tooling and AI-assisted data analysis.

Abstract

Recent Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation. However, their ability to create complex visualizations for scaled and structured data remains largely unevaluated and underdeveloped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as finance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Crucially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our comprehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious performance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope SynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent framework. Building upon this dataset, we develope PlotCraftor, a novel code generation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading proprietary approaches. Especially, on hard task, Our model achieves over 50% performance improvement. We will release the benchmark, dataset, and code at https://github.com/Speakn0w/PlotCraft-Benchmark.

PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization

TL;DR

PlotCraft addresses a critical gap in evaluating LLMs for complex data visualization by introducing a large-scale, multi-turn benchmark (PlotCraft) and a high-quality synthetic dataset (SynthVis-30K). The authors build PlotCraftor, an open-weight model trained via supervised fine-tuning on SynthVis-30K, which achieves state-of-the-art performance among open models and approaches proprietary baselines on hard visualization tasks. The benchmark includes 982–approximately 1k tasks spanning 48 chart types, 31 thematic topics, and three difficulty levels, with a rigorous, sandboxed evaluation pipeline that combines automated judging and human correlation analysis. Collectively, PlotCraft, SynthVis-30K, and PlotCraftor enable robust, scalable assessment and development of LLMs for complex data visualization, with implications for practical visualization tooling and AI-assisted data analysis.

Abstract

Recent Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation. However, their ability to create complex visualizations for scaled and structured data remains largely unevaluated and underdeveloped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as finance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Crucially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our comprehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious performance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope SynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent framework. Building upon this dataset, we develope PlotCraftor, a novel code generation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading proprietary approaches. Especially, on hard task, Our model achieves over 50% performance improvement. We will release the benchmark, dataset, and code at https://github.com/Speakn0w/PlotCraft-Benchmark.

Paper Structure

This paper contains 83 sections, 32 figures, 9 tables.

Figures (32)

  • Figure 1: An overview of the PlotCraft benchmark and the performance of several leading LLMs and our model, PlotCraftor. (Left) A polar bar chart compares the performance of PlotCraftor against five leading baseline models across all of our proposed sub-metrics. The purple area explicitly highlights the performance gains of PlotCraftor relative to its base model, Qwen3-Coder-30B-A3B. (Right) An example task from PlotCraft, which requires an LLM to process raw Kaggle data and a complex, human-written instruction to generate visualization code, which is then rendered into the final chart. PlotCraft benchmark comprises 1k high-quality evaluation instances.
  • Figure 2: A real-world example illustrating the limitations of LLMs on complex visualization tasks. When presented with a sophisticated request, the model generates a low-quality output and struggles to make effective improvements during the subsequent refinement process.
  • Figure 3: An overview of the SynthVis-30K dataset, detailing its coverage and the multi-agent framework used for its creation. (Left) A hierarchical chart illustrates the dataset's comprehensive coverage across three dimensions: thematic Topics, chart Types, and visualization Tasks. (Right) A schematic of our multi-agent data synthesis pipeline. This framework consists of two primary stages, Task Generation and Code Generation, which process raw Kaggle data to produce complete, multi-modal visualization instances. Each instance comprises structured data, a natural language instruction, the visualization code, and the corresponding rendered image.
  • Figure 4: Performance comparison of PlotCraftor and five leading LLMs on tasks of varying difficulty. The figure is split into two subplots: (a) Single-Turn Generation and (b) Multi-Turn Refinement. Within each subplot, we report the Pass Rate (%) and Average Score for tasks categorized as Easy, Medium, and Hard. The yellow hatched area within the PlotCraftor bars indicates the score contribution from its base model, Qwen3-Coder-30B-A3B.
  • Figure 5: Performance scaling on Easy vs. Hard.
  • ...and 27 more figures