Table of Contents
Fetching ...

Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

Jyotika Singh, Weiyi Sun, Amit Agarwal, Viji Krishnamurthy, Yassine Benajiba, Sujith Ravi, Dan Roth

TL;DR

This work formalizes the challenge of evaluating natural language representations of tabular DB outputs produced by Text-to-SQL systems. It introduces Combo-Eval, a hybrid evaluation framework that fuses metric-based signals with LLM-based judgments to better align with human assessments while drastically reducing LLM calls. The authors also release NLR-BIRD, a dedicated benchmark spanning 11 domains to evaluate NLR generation and judgments, and provide extensive analyses across two evaluation scenarios: ground-truth (GT) and user-question-and-database-result (UQDB). Key findings show that Combo-Eval consistently matches or exceeds the fidelity of LLM-alone judgments, especially with smaller judge models, and that GT references yield higher evaluation accuracy than UQDB, though UQDB remains a viable alternative when GT is unavailable. Overall, the work offers a practical, scalable path for benchmarking NL narrations of tabular data and sets a foundation for broader extensions to structured-data narration tasks.

Abstract

In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.

Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

TL;DR

This work formalizes the challenge of evaluating natural language representations of tabular DB outputs produced by Text-to-SQL systems. It introduces Combo-Eval, a hybrid evaluation framework that fuses metric-based signals with LLM-based judgments to better align with human assessments while drastically reducing LLM calls. The authors also release NLR-BIRD, a dedicated benchmark spanning 11 domains to evaluate NLR generation and judgments, and provide extensive analyses across two evaluation scenarios: ground-truth (GT) and user-question-and-database-result (UQDB). Key findings show that Combo-Eval consistently matches or exceeds the fidelity of LLM-alone judgments, especially with smaller judge models, and that GT references yield higher evaluation accuracy than UQDB, though UQDB remains a viable alternative when GT is unavailable. Overall, the work offers a practical, scalable path for benchmarking NL narrations of tabular data and sets a foundation for broader extensions to structured-data narration tasks.

Abstract

In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.

Paper Structure

This paper contains 34 sections, 1 equation, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Box plot of percentiles for NLRs in the dataset representing character counts across result-set sizes.
  • Figure 2: The Combo-Eval method flow combining metrics-based evaluation and LLM-as-a-judge.
  • Figure 3: Reasons for incorrect NLRs based on human assessment of model-produced NLRs.
  • Figure 4: Difference between median scores of class 1 and class 0. Scores are computed between model generated NLRs and (GT (blue) & UQDB (orange).)
  • Figure 5: Breakdown of incorrect judgments by result size across evaluation methods (Metrics-based, LLM-judge, and Combo-Eval) for GT and UQDB scenarios, showing higher misjudgments by LLM-as-a-judge on higher result sizes.
  • ...and 7 more figures