Table of Contents
Fetching ...

Evaluating the Effect of Retrieval Augmentation on Social Biases

Tianhui Zhang, Yi Zhou, Danushka Bollegala

TL;DR

Concerns are raised about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.

Abstract

Retrieval Augmented Generation (RAG) has gained popularity as a method for conveniently incorporating novel facts that were not seen during the pre-training stage in Large Language Model (LLM)-based Natural Language Generation (NLG) systems. However, LLMs are known to encode significant levels of unfair social biases. The modulation of these biases by RAG in NLG systems is not well understood. In this paper, we systematically study the relationship between the different components of a RAG system and the social biases presented in the text generated across three languages (i.e. English, Japanese and Chinese) and four social bias types (i.e. gender, race, age and religion). Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we evaluate the social biases in RAG responses from document collections with varying levels of stereotypical biases, employing multiple LLMs used as generators. We find that the biases in document collections are often amplified in the generated responses, even when the generating LLM exhibits a low-level of bias. Our findings raise concerns about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.

Evaluating the Effect of Retrieval Augmentation on Social Biases

TL;DR

Concerns are raised about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.

Abstract

Retrieval Augmented Generation (RAG) has gained popularity as a method for conveniently incorporating novel facts that were not seen during the pre-training stage in Large Language Model (LLM)-based Natural Language Generation (NLG) systems. However, LLMs are known to encode significant levels of unfair social biases. The modulation of these biases by RAG in NLG systems is not well understood. In this paper, we systematically study the relationship between the different components of a RAG system and the social biases presented in the text generated across three languages (i.e. English, Japanese and Chinese) and four social bias types (i.e. gender, race, age and religion). Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we evaluate the social biases in RAG responses from document collections with varying levels of stereotypical biases, employing multiple LLMs used as generators. We find that the biases in document collections are often amplified in the generated responses, even when the generating LLM exhibits a low-level of bias. Our findings raise concerns about the use of RAG as a technique for injecting novel facts into NLG systems and call for careful evaluation of potential social biases in RAG applications before their real-world deployment.

Paper Structure

This paper contains 48 sections, 6 equations, 10 figures, 18 tables.

Figures (10)

  • Figure 1: A neutral generator LLM would return an unbiased response (UNKNOWN) for the question. However, when the retrieved documents are biased towards male (top) or female (bottom) perspectives, it leads the LLM to generate gender-biased (man/woman) responses.
  • Figure 2: Overview of our RAG social bias evaluation protocol. Given a collection of documents, encoded individually using an external encoder $f$, a vector index is created over the collection of the documents. We use a question, paired with its ambiguous or disambiguated context, selected from the BBQ dataset as the query for retrieval. We retrieve the top $k$ nearest neighbour documents to the query from the vector index, and provide them to the generator LLM, $g$, alongside with the question and the context.
  • Figure 3: Diff-Bias under different retrieval sets (averaged over 6 LLM). Bars show the mean Diff-Bias across GPT-3.5, Llama3-8B-Inst., Qwen-7B-Inst., Qwen-14B base and Inst. and Qwen-72B-Inst. for each bias type. Gray=w/o RAG, red=stereo-set, green=full-set, blue=anti-set. Error bars representing Confidential Intervals (CIs) are omitted for visual clarity.(see \ref{['tbl:diff-bias:bias-type-full-ci']} for full CIs for each model and bias category)
  • Figure 4: Effect of different numbers of retrieved documents on Diff-Bias (averaged over Llama3-8B-Inst., Qwen-7B-Inst., Qwen-14B). X-axis shows the number of retrieved documents. Gray=w/o RAG, red=stereo-set, teal=full-set, blue=anti-set. The detailed figures for each model are shown in \ref{['fig:ambig_nums_retrieved']} and \ref{['fig:disambig_nums_retrieved']}
  • Figure 5: The evaluation templated used in our experiment. Here we take Instruction 1 as an example.
  • ...and 5 more figures