Table of Contents
Fetching ...

The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations

Theodora Worledge, Tatsunori Hashimoto, Carlos Guestrin

TL;DR

The extractive-abstractive spectrum is introduced, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points and five operating points are defined that span the extractive-abstractive spectrum are defined.

Abstract

Across all fields of academic study, experts cite their sources when sharing information. While large language models (LLMs) excel at synthesizing information, they do not provide reliable citation to sources, making it difficult to trace and verify the origins of the information they present. In contrast, search engines make sources readily accessible to users and place the burden of synthesizing information on the user. Through a survey, we find that users prefer search engines over LLMs for high-stakes queries, where concerns regarding information provenance outweigh the perceived utility of LLM responses. To examine the interplay between verifiability and utility of information-sharing tools, we introduce the extractive-abstractive spectrum, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points. Search engines are extractive because they respond to queries with snippets of sources with links (citations) to the original webpages. LLMs are abstractive because they address queries with answers that synthesize and logically transform relevant information from training and in-context sources without reliable citation. We define five operating points that span the extractive-abstractive spectrum and conduct human evaluations on seven systems across four diverse query distributions that reflect real-world QA settings: web search, language simplification, multi-step reasoning, and medical advice. As outputs become more abstractive, we find that perceived utility improves by as much as 200%, while the proportion of properly cited sentences decreases by as much as 50% and users take up to 3 times as long to verify cited information. Our findings recommend distinct operating points for domain-specific LLM systems and our failure analysis informs approaches to high-utility LLM systems that empower users to verify information.

The Extractive-Abstractive Spectrum: Uncovering Verifiability Trade-offs in LLM Generations

TL;DR

The extractive-abstractive spectrum is introduced, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points and five operating points are defined that span the extractive-abstractive spectrum are defined.

Abstract

Across all fields of academic study, experts cite their sources when sharing information. While large language models (LLMs) excel at synthesizing information, they do not provide reliable citation to sources, making it difficult to trace and verify the origins of the information they present. In contrast, search engines make sources readily accessible to users and place the burden of synthesizing information on the user. Through a survey, we find that users prefer search engines over LLMs for high-stakes queries, where concerns regarding information provenance outweigh the perceived utility of LLM responses. To examine the interplay between verifiability and utility of information-sharing tools, we introduce the extractive-abstractive spectrum, in which search engines and LLMs are extreme endpoints encapsulating multiple unexplored intermediate operating points. Search engines are extractive because they respond to queries with snippets of sources with links (citations) to the original webpages. LLMs are abstractive because they address queries with answers that synthesize and logically transform relevant information from training and in-context sources without reliable citation. We define five operating points that span the extractive-abstractive spectrum and conduct human evaluations on seven systems across four diverse query distributions that reflect real-world QA settings: web search, language simplification, multi-step reasoning, and medical advice. As outputs become more abstractive, we find that perceived utility improves by as much as 200%, while the proportion of properly cited sentences decreases by as much as 50% and users take up to 3 times as long to verify cited information. Our findings recommend distinct operating points for domain-specific LLM systems and our failure analysis informs approaches to high-utility LLM systems that empower users to verify information.

Paper Structure

This paper contains 77 sections, 21 figures, 11 tables.

Figures (21)

  • Figure 1: (Top left) Survey results from search engine and LLM users regarding the reasons they prefer search engines versus LLMs; search engines are desired for their ability to provide sources directly whereas LLMs are appreciated for their ability to synthesize information. The full survey questions are included in the appendix (\ref{['subsec:human_preference_survey']}). (Top right) Human evaluations show that perceived utility increases as the relative time taken to evaluate whether sentences are properly supported by citation increases across the extractive-abstractive spectrum. (Bottom) Examples of the five operating points from the reference implementations spanning the extractive-abstractive spectrum. More abstractive generations are more concise and better suit the reading level as requested in the query than more extractive generations, which more closely reflect the original source context, making them easier to verify.
  • Figure 2: We survey 200 individuals evenly stratified over the Gen Z, Millenial, Gen X, and Boomer generations and report results for those who have used both search engines and LLMs. For answering a query regarding A, B, C, or D, different proportions of users prefer to use search engines or LLMs. The original survey questions are included in the appendix (\ref{['subsec:human_preference_survey']}). The error bars represent 95% confidence intervals.
  • Figure 3: Human evaluation results averaged over the four query distributions. Fluency and perceived utility increase with abstraction, while citation precision and coverage decrease. Annotators take longer to evaluate coverage as generations become more abstractive. The error bars represent 95% confidence intervals.
  • Figure 4: Human evaluation results by query distribution. As generations become more abstractive, fluency increases similarly for all query distributions, while perceived utility increases at different OPs for different query distributions. Citation precision and coverage decrease while relative T2V increases across each query distributions as generations become more abstractive. The error bars represent 95% confidence intervals.
  • Figure 5: Human evaluation results across the four query distributions for all sentences. The error bars represent 95% confidence intervals.
  • ...and 16 more figures