Table of Contents
Fetching ...

VisEval: A Benchmark for Data Visualization in the Era of Large Language Models

Nan Chen, Yuge Zhang, Jiahang Xu, Kan Ren, Yuqing Yang

TL;DR

VisEval tackles the lack of a reliable NL2VIS benchmark by delivering a large, high-quality dataset (2,524 NL–VIS pairs across 146 databases) and a multi-dimension automated evaluation framework (validity, legality, readability). It combines rule-based, LLM-based, and human-assisted dataset curation with an SVG-based legality deconstruction and a GPT-4V–driven readability assessor, enabling end-to-end benchmarking of state-of-the-art LLMs on visualization generation. The study provides detailed analyses of model performance, common error modes, and the impact of data formatting and table disruption, offering actionable insights for future NL2VIS improvements. Overall, VisEval advances reproducible, scalable assessment of NL2VIS systems, supporting progress toward robust, readable, and legally correct visualizations generated from natural language. The benchmark has practical significance for researchers and industry seeking to benchmark and improve NL2VIS capabilities in the era of large language models.

Abstract

Translating natural language to visualization (NL2VIS) has shown great promise for visual data analysis, but it remains a challenging task that requires multiple low-level implementations, such as natural language processing and visualization design. Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. However, the lack of a comprehensive and reliable benchmark hinders our understanding of LLMs' capabilities in visualization generation. In this paper, we address this gap by proposing a new NL2VIS benchmark called VisEval. Firstly, we introduce a high-quality and large-scale dataset. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths. Secondly, we advocate for a comprehensive automated evaluation methodology covering multiple dimensions, including validity, legality, and readability. By systematically scanning for potential issues with a number of heterogeneous checkers, VisEval provides reliable and trustworthy evaluation outcomes. We run VisEval on a series of state-of-the-art LLMs. Our evaluation reveals prevalent challenges and delivers essential insights for future advancements.

VisEval: A Benchmark for Data Visualization in the Era of Large Language Models

TL;DR

VisEval tackles the lack of a reliable NL2VIS benchmark by delivering a large, high-quality dataset (2,524 NL–VIS pairs across 146 databases) and a multi-dimension automated evaluation framework (validity, legality, readability). It combines rule-based, LLM-based, and human-assisted dataset curation with an SVG-based legality deconstruction and a GPT-4V–driven readability assessor, enabling end-to-end benchmarking of state-of-the-art LLMs on visualization generation. The study provides detailed analyses of model performance, common error modes, and the impact of data formatting and table disruption, offering actionable insights for future NL2VIS improvements. Overall, VisEval advances reproducible, scalable assessment of NL2VIS systems, supporting progress toward robust, readable, and legally correct visualizations generated from natural language. The benchmark has practical significance for researchers and industry seeking to benchmark and improve NL2VIS capabilities in the era of large language models.

Abstract

Translating natural language to visualization (NL2VIS) has shown great promise for visual data analysis, but it remains a challenging task that requires multiple low-level implementations, such as natural language processing and visualization design. Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. However, the lack of a comprehensive and reliable benchmark hinders our understanding of LLMs' capabilities in visualization generation. In this paper, we address this gap by proposing a new NL2VIS benchmark called VisEval. Firstly, we introduce a high-quality and large-scale dataset. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths. Secondly, we advocate for a comprehensive automated evaluation methodology covering multiple dimensions, including validity, legality, and readability. By systematically scanning for potential issues with a number of heterogeneous checkers, VisEval provides reliable and trustworthy evaluation outcomes. We run VisEval on a series of state-of-the-art LLMs. Our evaluation reveals prevalent challenges and delivers essential insights for future advancements.
Paper Structure (47 sections, 21 figures, 12 tables)

This paper contains 47 sections, 21 figures, 12 tables.

Figures (21)

  • Figure 1: Example cases where previous methods fail short: (a) the absence of consideration for color channels, leading to the oversight of identical colors being used for different categories; and (b) misjudgment due to exact matching, where the ground truth maps the "rank" data field to the x-axis and the "sex" data field to the color channel. Since the query did not explicitly specify which data field should be mapped to which channel, this case should also be considered appropriate.
  • Figure 2: Example of (NL, VIS) pairs. Two NL queries correspond to the same VIS. Note that the ground truth VIS represents a feasible region for multiple acceptable visualization instances.
  • Figure 3: Statistical analysis of the dataset: (a) A histogram of the number of visualizations per database, and (b) the distribution of visualizations across different chart types and hardness.
  • Figure 4: The pipeline of VisEval includes three key modules: the validity checker, the legality checker, and the readability evaluator.
  • Figure 5: An example of using the readability evaluator. The layout check identified issues with the overflow of ticks and the title on the x-axis. The scale & ticks check revealed that the y-axis ticks were displayed using floating-point numbers, which is unconventional for representing integer values like the count of wins. These evaluations were given to the readability evaluator, resulting in a final overall score of 2 along with a concise rationale.
  • ...and 16 more figures