Table of Contents
Fetching ...

LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

Yen-Shan Chen, Jing Jin, Peng-Ting Kuo, Chao-Wei Huang, Yun-Nung Chen

TL;DR

This study investigates whether LLMs’ tendency to prefer self-generated content persists in retrieval-augmented generation (RAG) tasks. By simulating the pointwise reranking and generation phases across three QA datasets and five models, it shows that self-preference weakens in RAG contexts while factuality becomes the dominant criterion. Across datasets and architectures, LLMs prefer factually correct passages and are less biased toward their own outputs when evaluating relevance and generating responses. The findings suggest RAG pipelines mitigate self-preference biases and highlight the primacy of factual accuracy for robust QA in real-world retrieval settings.

Abstract

Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks, where keyword extraction and factual accuracy take precedence over stylistic elements, remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the \textit{pointwise reranking phase}. The second phase involves conducting pairwise reading comprehension tests to simulate the \textit{generation phase}. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs' output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.

LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

TL;DR

This study investigates whether LLMs’ tendency to prefer self-generated content persists in retrieval-augmented generation (RAG) tasks. By simulating the pointwise reranking and generation phases across three QA datasets and five models, it shows that self-preference weakens in RAG contexts while factuality becomes the dominant criterion. Across datasets and architectures, LLMs prefer factually correct passages and are less biased toward their own outputs when evaluating relevance and generating responses. The findings suggest RAG pipelines mitigate self-preference biases and highlight the primacy of factual accuracy for robust QA in real-world retrieval settings.

Abstract

Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks, where keyword extraction and factual accuracy take precedence over stylistic elements, remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the \textit{pointwise reranking phase}. The second phase involves conducting pairwise reading comprehension tests to simulate the \textit{generation phase}. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs' output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.

Paper Structure

This paper contains 22 sections, 4 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Illustration of the potential inaccuracy in RAG due to LLM's preference of self-citation.
  • Figure 2: An overview of our proposed experimental framework.
  • Figure 3: Self preference of models under different factualities (aggregated results).