Table of Contents
Fetching ...

Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting Generative AI-based Visualizations

Luca Podo, Muhammad Ishmal, Marco Angelini

TL;DR

The paper addresses the challenge of evaluating LLM-generated visualizations by introducing EvaLLM, a hierarchical evaluation stack with five layers (Code, Representation, Presentation, Informativeness, and LLM) that enables fine-grained automatic and human assessments. It pairs the stack with an online platform that supports multi-assessor labeling and automatic scoring across levels, demonstrated through two case studies using GPT-3.5-Turbo Code Interpreter and Llama2-70b on a nvBench subset. The results highlight strengths and typical failure modes in data mapping, axis handling, and presentation, and show EvaLLM’s potential as a standardized benchmarking framework for LLM-based visualization tasks. The work lays groundwork for systematic comparative benchmarking, while acknowledging limitations such as dataset breadth and the need for expanded taxonomy of errors, with future plans for broader datasets and model coverage. The practical impact lies in providing a rigorous, extensible methodology and toolchain to benchmark and interpret LLM-driven visualizations, supporting researchers and practitioners in improving reliability and interpretability of AI-generated visuals.

Abstract

The automatic generation of visualizations is an old task that, through the years, has shown more and more interest from the research and practitioner communities. Recently, large language models (LLM) have become an interesting option for supporting generative tasks related to visualization, demonstrating initial promising results. At the same time, several pitfalls, like the multiple ways of instructing an LLM to generate the desired result, the different perspectives leading the generation (code-based, image-based, grammar-based), and the presence of hallucinations even for the visualization generation task, make their usage less affordable than expected. Following similar initiatives for benchmarking LLMs, this paper copes with the problem of modeling the evaluation of a generated visualization through an LLM. We propose a theoretical evaluation stack, EvaLLM, that decomposes the evaluation effort in its atomic components, characterizes their nature, and provides an overview of how to implement and interpret them. We also designed and implemented an evaluation platform that provides a benchmarking resource for the visualization generation task. The platform supports automatic and manual scoring conducted by multiple assessors to support a fine-grained and semantic evaluation based on the EvaLLM stack. Two case studies on GPT3.5-turbo with Code Interpreter and Llama2-70-b models show the benefits of EvaLLM and illustrate interesting results on the current state-of-the-art LLM-generated visualizations.

Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting Generative AI-based Visualizations

TL;DR

The paper addresses the challenge of evaluating LLM-generated visualizations by introducing EvaLLM, a hierarchical evaluation stack with five layers (Code, Representation, Presentation, Informativeness, and LLM) that enables fine-grained automatic and human assessments. It pairs the stack with an online platform that supports multi-assessor labeling and automatic scoring across levels, demonstrated through two case studies using GPT-3.5-Turbo Code Interpreter and Llama2-70b on a nvBench subset. The results highlight strengths and typical failure modes in data mapping, axis handling, and presentation, and show EvaLLM’s potential as a standardized benchmarking framework for LLM-based visualization tasks. The work lays groundwork for systematic comparative benchmarking, while acknowledging limitations such as dataset breadth and the need for expanded taxonomy of errors, with future plans for broader datasets and model coverage. The practical impact lies in providing a rigorous, extensible methodology and toolchain to benchmark and interpret LLM-driven visualizations, supporting researchers and practitioners in improving reliability and interpretability of AI-generated visuals.

Abstract

The automatic generation of visualizations is an old task that, through the years, has shown more and more interest from the research and practitioner communities. Recently, large language models (LLM) have become an interesting option for supporting generative tasks related to visualization, demonstrating initial promising results. At the same time, several pitfalls, like the multiple ways of instructing an LLM to generate the desired result, the different perspectives leading the generation (code-based, image-based, grammar-based), and the presence of hallucinations even for the visualization generation task, make their usage less affordable than expected. Following similar initiatives for benchmarking LLMs, this paper copes with the problem of modeling the evaluation of a generated visualization through an LLM. We propose a theoretical evaluation stack, EvaLLM, that decomposes the evaluation effort in its atomic components, characterizes their nature, and provides an overview of how to implement and interpret them. We also designed and implemented an evaluation platform that provides a benchmarking resource for the visualization generation task. The platform supports automatic and manual scoring conducted by multiple assessors to support a fine-grained and semantic evaluation based on the EvaLLM stack. Two case studies on GPT3.5-turbo with Code Interpreter and Llama2-70-b models show the benefits of EvaLLM and illustrate interesting results on the current state-of-the-art LLM-generated visualizations.
Paper Structure (31 sections, 8 figures, 1 table)

This paper contains 31 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: EvaLLMStack: concept evaluation stack for evaluating the LLM-generated visualizations
  • Figure 2: TThe image depicts the LLM's potential involvement at each level, with colors indicating impact in the three scenarios discussed. Red signifies suboptimal integration, yellow prompts further evaluation, and green denotes positive impacts.
  • Figure 3: EvaLLMStack: concept evaluation stack for evaluating the LLM-generated visualizations
  • Figure 4: GPT-3.5-Turbo performance on 50 nvBench samples along mark and axes fields accuracy.
  • Figure 5: Examples of wrong generation by GPT-3.5 split by EvaLLM levels. Where the levels are purely numerical, the example is not reported.
  • ...and 3 more figures