Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models

Hyung-Kwon Ko; Hyeon Jeon; Gwanmo Park; Dae Hyun Kim; Nam Wook Kim; Juho Kim; Jinwook Seo

Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models

Hyung-Kwon Ko, Hyeon Jeon, Gwanmo Park, Dae Hyun Kim, Nam Wook Kim, Juho Kim, Jinwook Seo

TL;DR

VL2NL presents a scalable framework that generates diverse NL datasets for data visualization by transforming Vega-Lite specifications through guided discovery prompting and score-based paraphrasing. The authors introduce a large real-world Vega-Lite collection (1,981 specs) and demonstrate accurate extraction of chart semantics (L1/L2 captions) and rich, diverse NL utterances and questions. Empirical results show high semantic accuracy and markedly improved NL diversity, with finetuning experiments indicating performance gains when augmenting benchmarks with VL2NL-generated data. The work advances NLIs for data visualization by enabling fully automatic or mixed-initiative NL dataset generation, with practical implications for building more natural, scalable visualization interfaces.

Abstract

We introduce VL2NL, a Large Language Model (LLM) framework that generates rich and diverse NL datasets using only Vega-Lite specifications as input, thereby streamlining the development of Natural Language Interfaces (NLIs) for data visualization. To synthesize relevant chart semantics accurately and enhance syntactic diversity in each NL dataset, we leverage 1) a guided discovery incorporated into prompting so that LLMs can steer themselves to create faithful NL datasets in a self-directed manner; 2) a score-based paraphrasing to augment NL syntax along with four language axes. We also present a new collection of 1,981 real-world Vega-Lite specifications that have increased diversity and complexity than existing chart collections. When tested on our chart collection, VL2NL extracted chart semantics and generated L1/L2 captions with 89.4% and 76.0% accuracy, respectively. It also demonstrated generating and paraphrasing utterances and questions with greater diversity compared to the benchmarks. Last, we discuss how our NL datasets and framework can be utilized in real-world scenarios. The codes and chart collection are available at https://github.com/hyungkwonko/chart-llm.

Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models

TL;DR

Abstract

Paper Structure (61 sections, 5 figures, 9 tables)

This paper contains 61 sections, 5 figures, 9 tables.

Introduction
Background and Related Work
Chart Datasets
NLIs for Data Visualization
LLMs and NL Datasets
Vega-Lite Dataset
Dataset Construction
Search Queries.
Inclusion and Exclusion Criteria.
Post-processing.
Quantitative Analysis
Benchmarks.
Quality Metrics.
Complexity Levels.
Composite View, Interactivity, and Chart Type Distribution.
...and 46 more sections

Figures (5)

Figure 1: Example of Vega-Lite Specification. As previously noted in several works zhao2020chartseerluo2021synthesizing, Vega-Lite specification can be regarded to follow a tree structure, with its keys (i.e., properties) connected in a nested structure.
Figure 2: Vega-Lite dataset divided by their complexity levels: simple, medium, complex, extra complex. These 48 charts were selected via stratified sampling and used in our evaluation (\ref{['sec5']}). The level is divided based on the number of keys each specification contains. The number of keys, which are the criteria for dividing the levels, are set based on the quartiles (Q1, Q2, Q3) of Vega-Lite example gallery dataset vegalitegallery.
Figure 3: LLM Framework to Generate NL Datasets for Visualizations. We start by (b) preprocessing underlying datasets and minifying Vega-Lite specifications. Subsequently, (c) we employ scaffolding and key questions, (e) to generate NL datasets like L1/L2 captions, utterances, and questions. (d) This is followed by score-based paraphrasing, (f) allowing us to produce syntactically paraphrased NL datasets.
Figure 4: A system with two modes (fully-automatic and mixed-initiative) to generate NL datasets using VL2NL. The mixed-initiative mode encompasses several features. First, users can select the types of NL datasets they want to generate (C). They can inspect each chart (A) and subsequently choose the specific ones they wish to use for generating NL datasets (E). Users can change or provide information that the system utilizes (B). Once these are completed, the system returns the generated NL datasets (D). In contrast, the fully-automatic mode does not include (B). As a result, dataset generation in this mode strictly follows the scaffolding defined by researchers, along with key questions and answers generated by LLMs.
Figure 5: Four examples of generated L1/L2 captions with corresponding charts. We found that VL2NL can successfully generate captions even on complex charts with varying interactions and multiple views.

Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models

TL;DR

Abstract

Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)