Table of Contents
Fetching ...

TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

Yongxin Zhou, Philippe Mulhem, Didier Schwab

TL;DR

The paper studies how perturbations in retrieved content interact with LLM temperature in RAG systems, introducing a dedicated RAG Perturbation-Temperature Analysis Framework and a diagnostic benchmark tested on HotpotQA. It finds that higher temperatures consistently amplify vulnerability to perturbations and that perturbation effects can exhibit non-linear sensitivity across the temperature range. The work provides a taxonomy of perturbations, a 440-condition evaluation setup across diverse models, and practical guidelines for model selection and temperature tuning under noisy retrieval conditions. It also suggests future work to incorporate actual retrieval noise in end-to-end RAG pipelines.

Abstract

The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.

TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

TL;DR

The paper studies how perturbations in retrieved content interact with LLM temperature in RAG systems, introducing a dedicated RAG Perturbation-Temperature Analysis Framework and a diagnostic benchmark tested on HotpotQA. It finds that higher temperatures consistently amplify vulnerability to perturbations and that perturbation effects can exhibit non-linear sensitivity across the temperature range. The work provides a taxonomy of perturbations, a 440-condition evaluation setup across diverse models, and practical guidelines for model selection and temperature tuning under noisy retrieval conditions. It also suggests future work to incorporate actual retrieval noise in end-to-end RAG pipelines.

Abstract

The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.

Paper Structure

This paper contains 18 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: RAG Perturbation-Temperature Analysis Framework. The methodology stresses system robustness along two axes: external context perturbations (replacement, removal, NER substitution) and internal LLM temperature variation. The evaluation measures correctness and output variability across these conditions to establish a benchmark and derive practical guidelines.
  • Figure 2: BERTScore trends across temperature variations for different models, comparing response types under perturbation. Solid lines represent mean scores across samples, while shaded areas denote $\pm$ standard deviation. The top row presents results for comparison questions; the bottom row presents results for bridge questions.
  • Figure 3: Coefficient of Variation (CV) for BERTScore across models, temperatures, and perturbation types. Each subplot displays CV trends for a model. The baseline CV value (average CV for the original, unperturbed context across all temperatures) is indicated in the top left of each subplot. The top row presents results for comparison questions; the bottom row presents results for bridge questions.
  • Figure 4: BERTScore distribution for gpt-4o on bridge questions across perturbation types at two temperatures (Left: $T=0.6$, Right: $T=2.0$). Each subplot shows a boxplot representing median, interquartile range, and whiskers, with individual sample scores (black dots) and outliers (white dots).
  • Figure 5: BERTScore distribution for deepseek-reasoner on bridge questions across perturbation types at two temperatures (Left: $T=0.6$, Right: $T=2.0$). Each subplot shows a boxplot representing the same elements as in Fig. \ref{['fig:sample_score_distribution_gpt_4o']}.