Table of Contents
Fetching ...

VisCoder2: Building Multi-Language Visualization Coding Agents

Yuansheng Ni, Songcheng Cai, Xiangchao Chen, Jiarong Liang, Zhiheng Lyu, Jiaqi Deng, Kai Zou, Ping Nie, Fei Yuan, Xiang Yue, Wenhu Chen

TL;DR

VisCoder2 addresses the practical shortcomings of visualization code agents by introducing VisCode-Multi-679K, VisPlotBench, and a family of multi-language models. The dataset enables instruction-tuning across twelve languages with execution-validated samples and multi-turn correction, while VisPlotBench provides a robust evaluation framework with an execute-render-score protocol and self-debug. Empirical results show VisCoder2 outperforms open-source baselines and approaches the performance of proprietary models like GPT-4.1, with iterative self-debug delivering notable gains, especially for symbolic languages such as LilyPond, LaTeX, and Asymptote. Together, these resources establish a foundation for reliable, cross-language visualization coding agents capable of real-world iterative refinement.

Abstract

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.

VisCoder2: Building Multi-Language Visualization Coding Agents

TL;DR

VisCoder2 addresses the practical shortcomings of visualization code agents by introducing VisCode-Multi-679K, VisPlotBench, and a family of multi-language models. The dataset enables instruction-tuning across twelve languages with execution-validated samples and multi-turn correction, while VisPlotBench provides a robust evaluation framework with an execute-render-score protocol and self-debug. Empirical results show VisCoder2 outperforms open-source baselines and approaches the performance of proprietary models like GPT-4.1, with iterative self-debug delivering notable gains, especially for symbolic languages such as LilyPond, LaTeX, and Asymptote. Together, these resources establish a foundation for reliable, cross-language visualization coding agents capable of real-world iterative refinement.

Abstract

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.

Paper Structure

This paper contains 85 sections, 28 figures, 22 tables.

Figures (28)

  • Figure 1: Overview of VisCoder2. We present three components: 1) VisCode-Multi-679K: a dataset of 679K executable visualization code pairs with multi-round correction dialogues across 12 programming languages; 2)VisPlotBench: spanning 8 languages with natural language instructions, executable code, and rendered outputs; 3)VisCoder2: a family of visualization coding agents that iteratively execute, render, and self-debug, approaching the performance of proprietary models.
  • Figure 2: Data construction pipeline for VisCode-Multi-679K. We collect code blocks across twelve programming languages from open-source repositories, including large-scale code corpora, synthetic visualization datasets, and domain-specific diagram collections. We validate executability and render outputs through Jupyter-based runtime checks, yielding instructions paired with images. We integrate multi-turn dialogues from Code-Feedback to provide iterative correction supervision.
  • Figure 3: Overview of VisPlotBench. The benchmark covers eight visualization languages and contains 888 diverse visualization tasks, each combining a natural language instruction and a rendered visual. Tasks are annotated with a Visual category and a Subtype, spanning 13 categories in total.
  • Figure 4: Distribution of fine-grained visualization types in VisPlotBench. Tasks are organized into 13 Visual categories and 116 Subtypes, ensuring broad coverage of both common and underexplored visualization families. \ref{['list:list_of_appendix']}
  • Figure 5: Example of a successful generation in Python (ID: 1). The model generates code that executes successfully and produces a plot consistent with the ground truth. \ref{['list:list_of_appendix']}
  • ...and 23 more figures