Table of Contents
Fetching ...

Exploring Precision and Recall to assess the quality and diversity of LLMs

Florian Le Bronnec, Alexandre Verine, Benjamin Negrevergne, Yann Chevaleyre, Alexandre Allauzen

TL;DR

This work introduces a distribution-based evaluation framework for open-ended text generation by adapting Precision and Recall from image generation to NLP, comparing model outputs against a reference distribution without requiring aligned corpora. It defines Precision as the probability that model outputs fall within the reference support, and Recall as the probability that the reference support is covered by the model, both estimated in a latent space via $k$-NN after PCA. The authors demonstrate that separating quality (Precision) and diversity (Recall) yields clearer insights than single-metric baselines, revealing a trade-off influenced by instruction-tuning and model size across tasks like WebText, biographies, and creative writing. They show that instruction-tuned models are more precise but less diverse, larger models tend to be more diverse, and that in-context prompts can boost diversity for chat-style models, albeit with plateau effects. The framework extends the distribution-based NLP evaluation toolkit, enabling nuanced assessments of open-ended generation and offering practical guidance for model development, with code and data released for reproducibility and broader adoption.

Abstract

We introduce a novel evaluation framework for Large Language Models (LLMs) such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current LLMs face in generating diverse and high-quality text. We release our code and data.

Exploring Precision and Recall to assess the quality and diversity of LLMs

TL;DR

This work introduces a distribution-based evaluation framework for open-ended text generation by adapting Precision and Recall from image generation to NLP, comparing model outputs against a reference distribution without requiring aligned corpora. It defines Precision as the probability that model outputs fall within the reference support, and Recall as the probability that the reference support is covered by the model, both estimated in a latent space via -NN after PCA. The authors demonstrate that separating quality (Precision) and diversity (Recall) yields clearer insights than single-metric baselines, revealing a trade-off influenced by instruction-tuning and model size across tasks like WebText, biographies, and creative writing. They show that instruction-tuned models are more precise but less diverse, larger models tend to be more diverse, and that in-context prompts can boost diversity for chat-style models, albeit with plateau effects. The framework extends the distribution-based NLP evaluation toolkit, enabling nuanced assessments of open-ended generation and offering practical guidance for model development, with code and data released for reproducibility and broader adoption.

Abstract

We introduce a novel evaluation framework for Large Language Models (LLMs) such as \textsc{Llama-2} and \textsc{Mistral}, focusing on importing Precision and Recall metrics from image generation to text generation. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora. By conducting a comprehensive evaluation of state-of-the-art language models, the study reveals new insights into their performance on open-ended generation tasks, which are not adequately captured by traditional benchmarks. The findings highlight a trade-off between the quality and diversity of generated samples, particularly when models are fine-tuned on instruction dataset or with human feedback. This work extends the toolkit for distribution-based NLP evaluation, offering insights into the practical capabilities and challenges that current LLMs face in generating diverse and high-quality text. We release our code and data.
Paper Structure (74 sections, 1 theorem, 5 equations, 16 figures, 9 tables)

This paper contains 74 sections, 1 theorem, 5 equations, 16 figures, 9 tables.

Key Result

theorem 1

The set of Pareto-optimal trade-offs, that is, the PR-Curve, can be represented as the set of points $(\alpha_\lambda, \beta_\lambda)_{\{\lambda\in [0, \infty]\}}\in[0,1]^2$ such that:

Figures (16)

  • Figure 1: Precision and Recall of various models on generating the WebText dataset, with the 2 standard deviation error ellipsis. Chat and pre-trained models different behaviors are clearly captured by oevidenced.
  • Figure 2: Illustration of a simple case where using MAUVE score alone fails to provide a fine-grained evaluation of quality and diversity. We consider a reference dataset $P$ composed of articles from 2 labels, World and Sport. $Q_2$ is made of articles from the same distribution. We compare it with two other datasets: $Q_1$ composed only of World articles and $Q_3$ composed of even numbers of World, Sport, Business and Sci/Tech articles. Relatively to $Q_2$, the MAUVE scores of $Q_1$ and $Q_3$ are almost identical, while Precision and Recall help differentiating how the distributions actually differ from the reference $P$.
  • Figure 3: Example of distribution of images. $P$ is the reference distribution of images of the CelebA dataset liu_deep_2015, $Q_1$ and $Q_2$ are two different distributions of images. $Q_1$ has high quality, but low diversity, while $Q_2$ has high diversity and low quality. Numbers and images are from kynkaanniemi_improved_2019.
  • Figure 4: Precision and Recall for distribution-based metrics. (a) Distributions $P$ and $Q$. (b) Precision is the proportion of the support of $Q$ that generates $P$. (c) Recall is the proportion of the support $P$ generated by $Q$.
  • Figure 5: Our pipeline to compute the Precision and Recall metrics. Texts are projected into a latent space of a pre-trained model, where a $k$-NN estimation is performed to estimate the relative overlaps of $P$ and $Q$.
  • ...and 11 more figures

Theorems & Definitions (3)

  • definition 1
  • definition 2: Precision-Recall trade-off sajjadi_assessing_2018
  • theorem 1: PR-Curve simon_revisiting_2019