Are LLMs ready to help non-expert users to make charts of official statistics data?

Gadir Suleymanli; Alexander Rogiers; Lucas Lageweg; Jefrey Lijffijt

Are LLMs ready to help non-expert users to make charts of official statistics data?

Gadir Suleymanli, Alexander Rogiers, Lucas Lageweg, Jefrey Lijffijt

TL;DR

This paper investigates whether current large language models can assist non-experts in identifying relevant official statistics data and automatically generating accurate charts from natural language queries. It introduces an agentic, tool-enabled architecture that iteratively retrieves data, generates code, and refines visualizations, backed by a structured evaluation framework across data retrieval, code quality, and visual representation. Experiments across eight LLMs and 25 tasks using CBS data reveal data retrieval/manipulation as the main bottleneck, but show that agentic prompts with self-correction markedly improve end-to-end visualization quality, with Claude 3.7 achieving near-perfect scores when combined with contextual design guidance. The work provides a reusable benchmark and design patterns for text-to-vis applications on official statistics, with implications for democratizing access to reliable data and informing data literacy efforts.

Abstract

In this time when biased information, deep fakes, and propaganda proliferate, the accessibility of reliable data sources is more important than ever. National statistical institutes provide curated data that contain quantitative information on a wide range of topics. However, that information is typically spread across many tables and the plain numbers may be arduous to process. Hence, this open data may be practically inaccessible. We ask the question "Are current Generative AI models capable of facilitating the identification of the right data and the fully-automatic creation of charts to provide information in visual form, corresponding to user queries?". We present a structured evaluation of recent large language models' (LLMs) capabilities to generate charts from complex data in response to user queries. Working with diverse public data from Statistics Netherlands, we assessed multiple LLMs on their ability to identify relevant data tables, perform necessary manipulations, and generate appropriate visualizations autonomously. We propose a new evaluation framework spanning three dimensions: data retrieval & pre-processing, code quality, and visual representation. Results indicate that locating and processing the correct data represents the most significant challenge. Additionally, LLMs rarely implement visualization best practices without explicit guidance. When supplemented with information about effective chart design, models showed marked improvement in representation scores. Furthermore, an agentic approach with iterative self-evaluation led to excellent performance across all evaluation dimensions. These findings suggest that LLMs' effectiveness for automated chart generation can be enhanced through appropriate scaffolding and feedback mechanisms, and that systems can already reach the necessary accuracy across the three evaluation dimensions.

Are LLMs ready to help non-expert users to make charts of official statistics data?

TL;DR

Abstract

Are LLMs ready to help non-expert users to make charts of official statistics data?

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)