Table of Contents
Fetching ...

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation

Zdeněk Kasner, Ondřej Dušek

TL;DR

This paper introduces Quintd, a tool for collecting novel, public-API-based structured data to evaluate data-to-text generation without relying on reference outputs. By testing open LLMs (Llama 2, Mistral, Zephyr) across five domains in Quintd-1, the authors show that models produce fluent, zero-shot text but suffer substantial semantic errors, with over 80% of outputs containing at least one fault. They employ a dual evaluation framework—human crowdworker judgments and a GPT-4-based automatic metric—to quantify semantic fidelity at word, example, and domain levels, revealing partial agreement between methods and domain-specific variance. The work highlights practical lessons for preprocessing, long-context handling, and prompting, and offers concrete recommendations (focus on semantic accuracy, long-context models, real-world testing, multilinguality) along with public release of data and model outputs, aiming to spur robust, open, and reproducible D2T evaluation and development.

Abstract

We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs.

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation

TL;DR

This paper introduces Quintd, a tool for collecting novel, public-API-based structured data to evaluate data-to-text generation without relying on reference outputs. By testing open LLMs (Llama 2, Mistral, Zephyr) across five domains in Quintd-1, the authors show that models produce fluent, zero-shot text but suffer substantial semantic errors, with over 80% of outputs containing at least one fault. They employ a dual evaluation framework—human crowdworker judgments and a GPT-4-based automatic metric—to quantify semantic fidelity at word, example, and domain levels, revealing partial agreement between methods and domain-specific variance. The work highlights practical lessons for preprocessing, long-context handling, and prompting, and offers concrete recommendations (focus on semantic accuracy, long-context models, real-world testing, multilinguality) along with public release of data and model outputs, aiming to spur robust, open, and reproducible D2T evaluation and development.

Abstract

We analyze the behaviors of open large language models (LLMs) on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs.
Paper Structure (54 sections, 11 figures, 14 tables)

This paper contains 54 sections, 11 figures, 14 tables.

Figures (11)

  • Figure 1: To benchmark LLMs, we download unlabeled structured data from public APIs and prompt LLMs to generate texts based on the data. We annotate semantic errors in the outputs using reference-free metrics.
  • Figure 2: The prompt $\mathcal{P}$ and the model output prefix we used for the experiments in this paper. data is filled with the data record $x$ and output_type is filled accordingly for each domain $\mathcal{D}$ (see Table \ref{['tab:data']} and Table \ref{['tab:types']} in the Appendix).
  • Figure 3: Our experimental setup. We first generate the outputs using LLMs that are given raw data and a task-specific prompt. We annotate the word-level semantic errors in the LLM outputs with (a) an automatic metric based on GPT-4 that matches the output to the raw data, and (b) human annotators, who annotate the errors in the output given the data visualization.
  • Figure 4: The prompt we used for the GPT-4 evaluation metric.
  • Figure 5: The instructions given to the human annotators.
  • ...and 6 more figures