Table of Contents
Fetching ...

What Evidence Do Language Models Find Convincing?

Alexander Wan, Eric Wallace, Dan Klein

TL;DR

This work introduces ConflictingQA, a dataset linking controversial questions to conflicting real-world evidence to study how retrieval-augmented LLMs judge convincingness. By measuring paragraph win-rate and performing counterfactual perturbations, the study shows that model judgments hinge largely on relevance to the query rather than stylistic or credibility cues, revealing a misalignment with human credibility judgments. The findings emphasize the need to improve retrieval quality and to adjust training to better align model judgments with human preferences, including safeguards against misinformation. The work provides a practical benchmark and analysis framework for understanding evidence-level influences on RAG systems in open-ended, real-world questions.

Abstract

Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.

What Evidence Do Language Models Find Convincing?

TL;DR

This work introduces ConflictingQA, a dataset linking controversial questions to conflicting real-world evidence to study how retrieval-augmented LLMs judge convincingness. By measuring paragraph win-rate and performing counterfactual perturbations, the study shows that model judgments hinge largely on relevance to the query rather than stylistic or credibility cues, revealing a misalignment with human credibility judgments. The findings emphasize the need to improve retrieval quality and to adjust training to better align model judgments with human preferences, including safeguards against misinformation. The work provides a practical benchmark and analysis framework for understanding evidence-level influences on RAG systems in open-ended, real-world questions.

Abstract

Retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". To resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do I find convincing?". In this work, we study how LLMs answer this question. In particular, we construct ConflictingQA, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). We use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect LLM predictions. Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. Taken together, these results highlight the importance of RAG corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how LLMs are trained to better align with human judgements.
Paper Structure (20 sections, 1 equation, 9 figures, 12 tables)

This paper contains 20 sections, 1 equation, 9 figures, 12 tables.

Figures (9)

  • Figure 1: In ConflictingQA, we create contentious questions such as "is aspartame linked to cancer". We also retrieve evidence paragraphs for each question that contain different types of facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (Yes or No). For example, in the figure above we show two evidence paragraphs with their key arguments highlighted. Using ConflictingQA, we study why LLMs trust certain types of evidence paragraphs and argument styles over others.
  • Figure 2: Models over-rely on document relevance. We study how the convincingness of a particular evidence paragraph (measured through win-rate) changes when we modify it. We compare the effects of these changes to a baseline perturbation where we append "Thanks for reading!" to the end of the text (indicated by the dotted line). We find that many stylistic changes---inspired by factors that influence humans---have a neutral or even negative effect on models. On the other hand, perturbations that increase the text's relevance but minimally change its style have a substantial positive effect on models. Descriptions for each perturbation can be found in Appendix \ref{['app:perturb_descriptions']}.
  • Figure 3: Humans can read a paragraph in isolation and evaluate how convincing it is. For LLMs, when they are given a paragraph in isolation, they are unable to express its convincingness in words. Concretely, we plot the win rate of paragraphs versus what a model outputs when it is asked to judge the convincingness on a 1--5 Likert scale, the x-axis representing the quantiles of this metric. The error bars show a 95% CI.
  • Figure 4: Why do models prefer certain paragraphs over others? We test correlations between different features and paragraph win-rates. Here, we show LLaMA-2 Chat 13B (see all other models in Appendix \ref{['app:results']}), where the model tends to have a slight preference toward samples with low-perplexity \ref{['fig:gpt2perp']}. In addition, paragraphs with high relevancy scores (high question-paragraph embedding similarity) are significantly more convincing \ref{['fig:score']}. See Figure \ref{['fig:counterfactual_barchart']} for additional analysis. The error bars show the 95% CI (n = 242), and the x-axes represent the quantiles of the target feature.
  • Figure 5: The analogous plots to Figure \ref{['fig:features_llamachat']} except it is for Claude v1 Instant. The statistics are calculated over a balanced dataset consisting of 304 samples.
  • ...and 4 more figures