Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

Srishti Palani; Vidya Setlur

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

Srishti Palani, Vidya Setlur

TL;DR

Lexara is presented, a user-centered evaluation toolkit for CVA that operationalizes insights into test cases spanning real-world scenarios, interpretable metrics covering visualization quality and language quality, and an interactive toolkit enabling multi-format and multi-level exploration of results without programming expertise.

Abstract

Large Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use cases, evaluation criteria and workflows. We present Lexara, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-Judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated Lexara's effectiveness for guiding appropriate model and prompt selection.

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

TL;DR

Abstract

Paper Structure (48 sections, 7 figures, 3 tables)

This paper contains 48 sections, 7 figures, 3 tables.

Introduction
Related Work
CVA Tools
CVA Evaluation Methods
Benchmarks
Interactive Benchmarking Tools
Human and Automated Evaluation Methods
Evaluation Metrics for Visualization and Analytical Language
Formative Studies: Eliciting Real-World Use Cases, Evaluation Criteria & Workflows
Study 1: Tool Developers' Use Cases, Evaluation Criteria & Workflows
Study 2: End-Users' Use Cases, Evaluation Criteria & Workflows
Characteristics of CVA Use Cases
Visualization Types
Ambiguity in User Utterances
Evaluation Criteria for CVA Use Cases
...and 33 more sections

Figures (7)

Figure 1: Illustration of how practitioners evaluate the multi-format CVA response. The example shows (left) a user utterance and corresponding model outputs, and (right) the evaluation criteria identified in our formative studies: visualization response quality assessed by looking at both the rendered visualization and grammar specification (data fidelity, chart type, functionality, design), natural language response quality (factual grounding, analytical thinking and conversation quality). This figure provides a conceptual overview and does not reflect the actual UI of any CVA system.
Figure 2: Lexara's interactive CVA evaluation interface supports two core workflows: (1) an Evaluation Setup Panel where practitioners upload datasources, define test cases, specify prompts, models, expected outputs, and configure CVA-specific metrics; and (2) an Interactive Results Table that streams model outputs—visualizations, structured specs, and natural language—side-by-side. The table enables multi-granular inspection, with expandable metric categories, on-hover explanations, and tools to trace divergences between expected and actual outputs.
Figure 3: For each user request, the system aligns expected and actual outputs across three formats: visualizations, natural language explanations, and JSON specifications. By surfacing detailed differences (e.g., encodings, aggregations, chart types), the interface enables practitioners pinpoint divergences, understand model behavior, and diagnose strengths or failure modes for various analytic tasks.
Figure 4: The overview panel (top left) highlights recommended model–prompt pairs and aggregated metrics. The label view (top right) breaks down results by chart type, ambiguity, and context-handling. The utterance-level view (bottom) contrasts expected vs. actual responses with detailed metric explanations.
Figure 5: A browser plugin recorded participants' interactions with a popular CVA tool, capturing their utterances, model responses, in-the-loop evaluations via Likert-type scales, and corrected expected outputs.
...and 2 more figures

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

TL;DR

Abstract

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

Authors

TL;DR

Abstract

Table of Contents

Figures (7)