Table of Contents
Fetching ...

Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

James Ford, Xingmeng Zhao, Dan Schumacher, Anthony Rios

TL;DR

The results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures, and demonstrate that few-shot prompting significantly boosts the accuracy of chart generation.

Abstract

We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI's GPT-3.5 Turbo and Meta's Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.

Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations

TL;DR

The results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures, and demonstrate that few-shot prompting significantly boosts the accuracy of chart generation.

Abstract

We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI's GPT-3.5 Turbo and Meta's Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field.
Paper Structure (7 sections, 7 figures, 5 tables)

This paper contains 7 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An overview of the VQA evaluation process for generated visualizations. The visual LLMs (vLLMs) represent trained models for chat QA.
  • Figure 2: Overall framework for our study, where we perform automatic chart generation, benchmarking using chart question answering, manually analyze errors, and perform a survey on chart quality.
  • Figure 3: Chart Visualization Errors: Category Ambiguous
  • Figure 4: Chart Visualization Errors: Colors Not Matching
  • Figure 5: Chart Visualization Errors: Dates Errors
  • ...and 2 more figures