Table of Contents
Fetching ...

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

Srishti Palani, Vidya Setlur

TL;DR

Lexara is presented, a user-centered evaluation toolkit for CVA that operationalizes insights into test cases spanning real-world scenarios, interpretable metrics covering visualization quality and language quality, and an interactive toolkit enabling multi-format and multi-level exploration of results without programming expertise.

Abstract

Large Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use cases, evaluation criteria and workflows. We present Lexara, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-Judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated Lexara's effectiveness for guiding appropriate model and prompt selection.

Lexara: A User-Centered Toolkit for Evaluating Large Language Models for Conversational Visual Analytics

TL;DR

Lexara is presented, a user-centered evaluation toolkit for CVA that operationalizes insights into test cases spanning real-world scenarios, interpretable metrics covering visualization quality and language quality, and an interactive toolkit enabling multi-format and multi-level exploration of results without programming expertise.

Abstract

Large Language Models (LLMs) are transforming Conversational Visual Analytics (CVA) by enabling data analysis through natural language. However, evaluating LLMs for CVA remains a challenge: requiring programming expertise, overlooking real-world complexity, and lacking interpretable metrics for multi-format (visualizations and text) outputs. Through interviews with 22 CVA developers and 16 end-users, we identified use cases, evaluation criteria and workflows. We present Lexara, a user-centered evaluation toolkit for CVA that operationalizes these insights into: (i) test cases spanning real-world scenarios; (ii) interpretable metrics covering visualization quality (data fidelity, semantic alignment, functional correctness, design clarity) and language quality (factual grounding, analytical reasoning, conversational coherence) using rule-based and LLM-as-a-Judge methods; and (iii) an interactive toolkit enabling experimental setup and multi-format and multi-level exploration of results without programming expertise. We conducted a two-week diary study with six CVA developers, drawn from our initial cohort of 22. Their feedback demonstrated Lexara's effectiveness for guiding appropriate model and prompt selection.
Paper Structure (48 sections, 7 figures, 3 tables)

This paper contains 48 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of how practitioners evaluate the multi-format CVA response. The example shows (left) a user utterance and corresponding model outputs, and (right) the evaluation criteria identified in our formative studies: visualization response quality assessed by looking at both the rendered visualization and grammar specification (data fidelity, chart type, functionality, design), natural language response quality (factual grounding, analytical thinking and conversation quality). This figure provides a conceptual overview and does not reflect the actual UI of any CVA system.
  • Figure 2: Lexara's interactive CVA evaluation interface supports two core workflows: (1) an Evaluation Setup Panel where practitioners upload datasources, define test cases, specify prompts, models, expected outputs, and configure CVA-specific metrics; and (2) an Interactive Results Table that streams model outputs—visualizations, structured specs, and natural language—side-by-side. The table enables multi-granular inspection, with expandable metric categories, on-hover explanations, and tools to trace divergences between expected and actual outputs.
  • Figure 3: For each user request, the system aligns expected and actual outputs across three formats: visualizations, natural language explanations, and JSON specifications. By surfacing detailed differences (e.g., encodings, aggregations, chart types), the interface enables practitioners pinpoint divergences, understand model behavior, and diagnose strengths or failure modes for various analytic tasks.
  • Figure 4: The overview panel (top left) highlights recommended model–prompt pairs and aggregated metrics. The label view (top right) breaks down results by chart type, ambiguity, and context-handling. The utterance-level view (bottom) contrasts expected vs. actual responses with detailed metric explanations.
  • Figure 5: A browser plugin recorded participants' interactions with a popular CVA tool, capturing their utterances, model responses, in-the-loop evaluations via Likert-type scales, and corrected expected outputs.
  • ...and 2 more figures