Fairness Testing in Retrieval-Augmented Generation: How Small Perturbations Reveal Bias in Small Language Models
Matheus Vinicius da Silva de Oliveira, Jonathan de Andrade Silva, Awdren de Lima Fontao
TL;DR
This work targets fairness testing in Retrieval-Augmented Generation (RAG) by applying metamorphic testing to small language models and their retrievers. It introduces the Retriever Robustness Score (RRS) to diagnose bias at the retrieval stage and systematically quantifies how demographic perturbations shift toxicity and sentiment outputs across three HuggingFace SLMs in a RAG pipeline. Empirical results show substantial fairness violations, with a 28.52% ASR in the retriever and up to 33.00% end-to-end ASR, and a clear bias hierarchy where racial perturbations are the strongest drivers. The study provides actionable metrics and deployment guidance for SMEs, highlighting the need for component-level fairness testing in RAG systems and the importance of curating retrieval content to prevent bias amplification. Overall, it shifts fairness auditing from model-centric evaluation to end-to-end system and intermediate component analysis, enabling more reliable and equitable AI-enabled software in production.
Abstract
Large Language Models (LLMs) are widely used across multiple domains but continue to raise concerns regarding security and fairness. Beyond known attack vectors such as data poisoning and prompt injection, LLMs are also vulnerable to fairness bugs. These refer to unintended behaviors influenced by sensitive demographic cues (e.g., race or sexual orientation) that should not affect outcomes. Another key issue is hallucination, where models generate plausible yet false information. Retrieval-Augmented Generation (RAG) has emerged as a strategy to mitigate hallucinations by combining external retrieval with text generation. However, its adoption raises new fairness concerns, as the retrieved content itself may surface or amplify bias. This study conducts fairness testing through metamorphic testing (MT), introducing controlled demographic perturbations in prompts to assess fairness in sentiment analysis performed by three Small Language Models (SLMs) hosted on HuggingFace (Llama-3.2-3B-Instruct, Mistral-7B-Instruct-v0.3, and Llama-3.1-Nemotron-8B), each integrated into a RAG pipeline. Results show that minor demographic variations can break up to one third of metamorphic relations (MRs). A detailed analysis of these failures reveals a consistent bias hierarchy, with perturbations involving racial cues being the predominant cause of the violations. In addition to offering a comparative evaluation, this work reinforces that the retrieval component in RAG must be carefully curated to prevent bias amplification. The findings serve as a practical alert for developers, testers and small organizations aiming to adopt accessible SLMs without compromising fairness or reliability.
