Table of Contents
Fetching ...

Retrieving Versus Understanding Extractive Evidence in Few-Shot Learning

Karl Elbakian, Samuel Carton

TL;DR

This work investigates how large language models retrieve and interpret within-document evidence to support few-shot predictions, using GPT-4 and Gemini across five datasets with gold-standard extractive annotations. It shows a strong association between label errors and evidence retrieval errors, yet finds that retrieval failures often stem from misinterpretation or confounding evidence rather than missing evidence, a favorable sign for verification-based applications. Through ablations on prompting and evidence availability, the study demonstrates that extracting correct evidence is feasible in many cases, and that missing evidence is typically due to competing within-document signals rather than fundamental interpretability limits. The findings suggest that extractive self-rationalization can power downstream verification tools, provided the retrieval component is robust and supported by human-in-the-loop inspection when needed.

Abstract

A key aspect of alignment is the proper use of within-document evidence to construct document-level decisions. We analyze the relationship between the retrieval and interpretation of within-document evidence for large language model in a few-shot setting. Specifically, we measure the extent to which model prediction errors are associated with evidence retrieval errors with respect to gold-standard human-annotated extractive evidence for five datasets, using two popular closed proprietary models. We perform two ablation studies to investigate when both label prediction and evidence retrieval errors can be attributed to qualities of the relevant evidence. We find that there is a strong empirical relationship between model prediction and evidence retrieval error, but that evidence retrieval error is mostly not associated with evidence interpretation error--a hopeful sign for downstream applications built on this mechanism.

Retrieving Versus Understanding Extractive Evidence in Few-Shot Learning

TL;DR

This work investigates how large language models retrieve and interpret within-document evidence to support few-shot predictions, using GPT-4 and Gemini across five datasets with gold-standard extractive annotations. It shows a strong association between label errors and evidence retrieval errors, yet finds that retrieval failures often stem from misinterpretation or confounding evidence rather than missing evidence, a favorable sign for verification-based applications. Through ablations on prompting and evidence availability, the study demonstrates that extracting correct evidence is feasible in many cases, and that missing evidence is typically due to competing within-document signals rather than fundamental interpretability limits. The findings suggest that extractive self-rationalization can power downstream verification tools, provided the retrieval component is robust and supported by human-in-the-loop inspection when needed.

Abstract

A key aspect of alignment is the proper use of within-document evidence to construct document-level decisions. We analyze the relationship between the retrieval and interpretation of within-document evidence for large language model in a few-shot setting. Specifically, we measure the extent to which model prediction errors are associated with evidence retrieval errors with respect to gold-standard human-annotated extractive evidence for five datasets, using two popular closed proprietary models. We perform two ablation studies to investigate when both label prediction and evidence retrieval errors can be attributed to qualities of the relevant evidence. We find that there is a strong empirical relationship between model prediction and evidence retrieval error, but that evidence retrieval error is mostly not associated with evidence interpretation error--a hopeful sign for downstream applications built on this mechanism.

Paper Structure

This paper contains 21 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Artificial examples of abstractive and extractive explanations for an erroneous moderation prediction. Only the extractive explanation provides a basis for refuting it.
  • Figure 2: Two examples of GPT-4 prompts on the same SciFact item. Model output highlighted in green. Human-annotated evidence underlined, claim bolded. The model misses the evidence and mislabels the document in the predict-then-explain setting, but correctly labels it when the evidence is provided, even without its surrounding context.
  • Figure 3: Box-and-whisker plots of label prediction error versus mean predicted evidence rationale F1 for all five datasets in the explain then predict condition.