Table of Contents
Fetching ...

Quantifying perturbation impacts for large language models

Paulius Rauba, Qiyao Wei, Mihaela van der Schaar

TL;DR

The paper addresses the challenge of quantifying how input perturbations affect stochastic outputs of large language models. It introduces Distribution-Based Perturbation Analysis (DBPA), reframing perturbation effects as a frequentist hypothesis test by constructing empirical output distributions in a low-dimensional semantic space via Monte Carlo sampling. DBPA uses a four-step procedure (sample outputs, build null and alternative distributions with pairwise similarities, compare with a discrepancy metric, and perform permutation-based inference) and reports p-values and an effect-size measure (based on Jensen-Shannon divergence) in a model-agnostic and computationally efficient way. Through case studies on prompt robustness, answer divergence, and alignment with a reference model, the approach demonstrates practical applicability for auditing and reliability assessment of LLMs in high-stakes settings.

Abstract

We consider the problem of quantifying how an input perturbation impacts the outputs of large language models (LLMs), a fundamental task for model reliability and post-hoc interpretability. A key obstacle in this domain is disentangling the meaningful changes in model responses from the intrinsic stochasticity of LLM outputs. To overcome this, we introduce Distribution-Based Perturbation Analysis (DBPA), a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. DBPA constructs empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling. Comparisons of Monte Carlo estimates in the reduced dimensionality space enables tractable frequentist inference without relying on restrictive distributional assumptions. The framework is model-agnostic, supports the evaluation of arbitrary input perturbations on any black-box LLM, yields interpretable p-values, supports multiple perturbation testing via controlled error rates, and provides scalar effect sizes for any chosen similarity or distance metric. We demonstrate the effectiveness of DBPA in evaluating perturbation impacts, showing its versatility for perturbation analysis.

Quantifying perturbation impacts for large language models

TL;DR

The paper addresses the challenge of quantifying how input perturbations affect stochastic outputs of large language models. It introduces Distribution-Based Perturbation Analysis (DBPA), reframing perturbation effects as a frequentist hypothesis test by constructing empirical output distributions in a low-dimensional semantic space via Monte Carlo sampling. DBPA uses a four-step procedure (sample outputs, build null and alternative distributions with pairwise similarities, compare with a discrepancy metric, and perform permutation-based inference) and reports p-values and an effect-size measure (based on Jensen-Shannon divergence) in a model-agnostic and computationally efficient way. Through case studies on prompt robustness, answer divergence, and alignment with a reference model, the approach demonstrates practical applicability for auditing and reliability assessment of LLMs in high-stakes settings.

Abstract

We consider the problem of quantifying how an input perturbation impacts the outputs of large language models (LLMs), a fundamental task for model reliability and post-hoc interpretability. A key obstacle in this domain is disentangling the meaningful changes in model responses from the intrinsic stochasticity of LLM outputs. To overcome this, we introduce Distribution-Based Perturbation Analysis (DBPA), a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. DBPA constructs empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling. Comparisons of Monte Carlo estimates in the reduced dimensionality space enables tractable frequentist inference without relying on restrictive distributional assumptions. The framework is model-agnostic, supports the evaluation of arbitrary input perturbations on any black-box LLM, yields interpretable p-values, supports multiple perturbation testing via controlled error rates, and provides scalar effect sizes for any chosen similarity or distance metric. We demonstrate the effectiveness of DBPA in evaluating perturbation impacts, showing its versatility for perturbation analysis.

Paper Structure

This paper contains 19 sections, 13 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example of null and alternative distributions. The null distribution $P_0$ (left, blue) is constructed based on the intrinsic variability of responses. The alternative distribution with a perturbed input $P_1$ (right, red) is quantified with respect to the original distributions. This measures the output distribution change given a perturbed prompt in the cosine similarity space.
  • Figure 2: Examples of different dimensionality metrics. We show that we can use multiple metrics to reduce the dimensionality of given embeddings.
  • Figure 3: Measuring the effect size $\omega$ and statistical significance of outputs when prefixing the original question with various "Act as..." prompts. Results show that relevant professional roles (e.g., medical professions) yield consistent outputs, while diverse roles produce significantly different responses, demonstrating the framework's ability to quantify prompt perturbation effects. If $p < \alpha$, where $\alpha = 0.05$, we say that the output distribution is significant.

Theorems & Definitions (2)

  • Definition 1: Sensitivity
  • Definition 2: Output Distribution