Table of Contents
Fetching ...

Are LLMs ready for Visualization?

Pere-Pau Vázquez

TL;DR

The goal is to fill the gap with a systematic approach that analyzes whether Large Language Models are capable of correctly generating a large variety of charts, what libraries they can deal with effectively, and how far the authors can go to configure individual charts.

Abstract

Generative models have received a lot of attention in many areas of academia and the industry. Their capabilities span many areas, from the invention of images given a prompt to the generation of concrete code to solve a certain programming issue. These two paradigmatic cases fall within two distinct categories of requirements, ranging from "creativity" to "precision", as characterized by Bing Chat, which employs ChatGPT-4 as its backbone. Visualization practitioners and researchers have wondered to what end one of such systems could accomplish our work in a more efficient way. Several works in the literature have utilized them for the creation of visualizations. And some tools such as Lida, incorporate them as part of their pipeline. Nevertheless, to the authors' knowledge, no systematic approach for testing their capabilities has been published, which includes both extensive and in-depth evaluation. Our goal is to fill that gap with a systematic approach that analyzes three elements: whether Large Language Models are capable of correctly generating a large variety of charts, what libraries they can deal with effectively, and how far we can go to configure individual charts. To achieve this objective, we initially selected a diverse set of charts, which are commonly utilized in data visualization. We then developed a set of generic prompts that could be used to generate them, and analyzed the performance of different LLMs and libraries. The results include both the set of prompts and the data sources, as well as an analysis of the performance with different configurations.

Are LLMs ready for Visualization?

TL;DR

The goal is to fill the gap with a systematic approach that analyzes whether Large Language Models are capable of correctly generating a large variety of charts, what libraries they can deal with effectively, and how far the authors can go to configure individual charts.

Abstract

Generative models have received a lot of attention in many areas of academia and the industry. Their capabilities span many areas, from the invention of images given a prompt to the generation of concrete code to solve a certain programming issue. These two paradigmatic cases fall within two distinct categories of requirements, ranging from "creativity" to "precision", as characterized by Bing Chat, which employs ChatGPT-4 as its backbone. Visualization practitioners and researchers have wondered to what end one of such systems could accomplish our work in a more efficient way. Several works in the literature have utilized them for the creation of visualizations. And some tools such as Lida, incorporate them as part of their pipeline. Nevertheless, to the authors' knowledge, no systematic approach for testing their capabilities has been published, which includes both extensive and in-depth evaluation. Our goal is to fill that gap with a systematic approach that analyzes three elements: whether Large Language Models are capable of correctly generating a large variety of charts, what libraries they can deal with effectively, and how far we can go to configure individual charts. To achieve this objective, we initially selected a diverse set of charts, which are commonly utilized in data visualization. We then developed a set of generic prompts that could be used to generate them, and analyzed the performance of different LLMs and libraries. The results include both the set of prompts and the data sources, as well as an analysis of the performance with different configurations.
Paper Structure (14 sections, 7 figures, 3 tables)

This paper contains 14 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Generation of a grouped bar chart using ChatGPT3 and ChatGPT4 with the default configuration. In the first case, the system fails because it incorrectly assumes that only two values will be present per type category. ChatGPT, although it also uses matplotlib by default, properly generates as many bars as values per category. Nonetheless, the code additionally utilizes default values for the majority of the chart, resulting in unintended outcomes such as a legend that overlaps the data and labels that do not fit within the window.
  • Figure 2: Incorrect generation of a range plot by ChatGPT3. In this case, the chart is displayed, but it does not correspond to a range plot.
  • Figure 3: Incorrect generation of a dot plot by ChatGPT4. The code executes, but the output resembles a scatterplot with numbers.
  • Figure 4: Generation of a bullet chart using the ages' dataset. The prompt specifies the column B as the value, and the column C as the second value. From the three prompts, only the one using Altair seems to succeed. The default configuration (left) renders two bars, but with the same size and opacity, thus making it difficult to understand whether it is a stack bar or two similar values are represented. Plotly generates something not very similar to a bullet chart, while Altair (right) uses a tick to mark the reference value.
  • Figure 5: Generation of a pyramid chart using the ages' dataset. From the three prompts, only the one using the default ChatGPT4 configuration (left) works properly. Plotly generates something really awkward, while Altair (right) displays to side-by-side bar charts.
  • ...and 2 more figures