Table of Contents
Fetching ...

Are LLMs ready to help non-expert users to make charts of official statistics data?

Gadir Suleymanli, Alexander Rogiers, Lucas Lageweg, Jefrey Lijffijt

TL;DR

This paper investigates whether current large language models can assist non-experts in identifying relevant official statistics data and automatically generating accurate charts from natural language queries. It introduces an agentic, tool-enabled architecture that iteratively retrieves data, generates code, and refines visualizations, backed by a structured evaluation framework across data retrieval, code quality, and visual representation. Experiments across eight LLMs and 25 tasks using CBS data reveal data retrieval/manipulation as the main bottleneck, but show that agentic prompts with self-correction markedly improve end-to-end visualization quality, with Claude 3.7 achieving near-perfect scores when combined with contextual design guidance. The work provides a reusable benchmark and design patterns for text-to-vis applications on official statistics, with implications for democratizing access to reliable data and informing data literacy efforts.

Abstract

In this time when biased information, deep fakes, and propaganda proliferate, the accessibility of reliable data sources is more important than ever. National statistical institutes provide curated data that contain quantitative information on a wide range of topics. However, that information is typically spread across many tables and the plain numbers may be arduous to process. Hence, this open data may be practically inaccessible. We ask the question "Are current Generative AI models capable of facilitating the identification of the right data and the fully-automatic creation of charts to provide information in visual form, corresponding to user queries?". We present a structured evaluation of recent large language models' (LLMs) capabilities to generate charts from complex data in response to user queries. Working with diverse public data from Statistics Netherlands, we assessed multiple LLMs on their ability to identify relevant data tables, perform necessary manipulations, and generate appropriate visualizations autonomously. We propose a new evaluation framework spanning three dimensions: data retrieval & pre-processing, code quality, and visual representation. Results indicate that locating and processing the correct data represents the most significant challenge. Additionally, LLMs rarely implement visualization best practices without explicit guidance. When supplemented with information about effective chart design, models showed marked improvement in representation scores. Furthermore, an agentic approach with iterative self-evaluation led to excellent performance across all evaluation dimensions. These findings suggest that LLMs' effectiveness for automated chart generation can be enhanced through appropriate scaffolding and feedback mechanisms, and that systems can already reach the necessary accuracy across the three evaluation dimensions.

Are LLMs ready to help non-expert users to make charts of official statistics data?

TL;DR

This paper investigates whether current large language models can assist non-experts in identifying relevant official statistics data and automatically generating accurate charts from natural language queries. It introduces an agentic, tool-enabled architecture that iteratively retrieves data, generates code, and refines visualizations, backed by a structured evaluation framework across data retrieval, code quality, and visual representation. Experiments across eight LLMs and 25 tasks using CBS data reveal data retrieval/manipulation as the main bottleneck, but show that agentic prompts with self-correction markedly improve end-to-end visualization quality, with Claude 3.7 achieving near-perfect scores when combined with contextual design guidance. The work provides a reusable benchmark and design patterns for text-to-vis applications on official statistics, with implications for democratizing access to reliable data and informing data literacy efforts.

Abstract

In this time when biased information, deep fakes, and propaganda proliferate, the accessibility of reliable data sources is more important than ever. National statistical institutes provide curated data that contain quantitative information on a wide range of topics. However, that information is typically spread across many tables and the plain numbers may be arduous to process. Hence, this open data may be practically inaccessible. We ask the question "Are current Generative AI models capable of facilitating the identification of the right data and the fully-automatic creation of charts to provide information in visual form, corresponding to user queries?". We present a structured evaluation of recent large language models' (LLMs) capabilities to generate charts from complex data in response to user queries. Working with diverse public data from Statistics Netherlands, we assessed multiple LLMs on their ability to identify relevant data tables, perform necessary manipulations, and generate appropriate visualizations autonomously. We propose a new evaluation framework spanning three dimensions: data retrieval & pre-processing, code quality, and visual representation. Results indicate that locating and processing the correct data represents the most significant challenge. Additionally, LLMs rarely implement visualization best practices without explicit guidance. When supplemented with information about effective chart design, models showed marked improvement in representation scores. Furthermore, an agentic approach with iterative self-evaluation led to excellent performance across all evaluation dimensions. These findings suggest that LLMs' effectiveness for automated chart generation can be enhanced through appropriate scaffolding and feedback mechanisms, and that systems can already reach the necessary accuracy across the three evaluation dimensions.

Paper Structure

This paper contains 24 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Sample visualization generated by Claude 3.7 using a modular prompt and a self-feedback loop (7 iterations), responding to the prompt: "Plot and compare foreign and domestic turnover for All Sectors." The chart compares domestic and foreign turnover indices from 2005 to 2023, averaged across all industry sectors, based on data from Statistics Netherlands. Raw monthly trends are shown with thin lines, while 12-month moving averages are overlaid with thicker curves to highlight longer-term patterns. Two key economic disruptions (the 2008–2009 Financial Crisis and the COVID-19 Pandemic) are marked with shaded regions. The visualization showcases use of reasoning, clear axis labeling, and appropriate use of smoothing and context. Sample figures are displayed as produced by the LLM system, without any post-processing. More results can be found at https://github.com/aida-ugent/llm_visualization_results
  • Figure 2: The zero-shot system prompt template used in our experiments. This template was populated with dataset-specific information and task requirements for each visualization request.
  • Figure 3: Pseudocode of the agentic loop.
  • Figure 4: The agentic system prompt template used in our tool-equipped experiments. This template guided the LLM through a structured workflow while providing access to specialized tools for data exploration, code execution, and visualization verification.
  • Figure 5: Typical decision flow of the agentic visualization system.
  • ...and 5 more figures