Table of Contents
Fetching ...

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Maxim Khomiakov, Jes Frellsen

Abstract

Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Abstract

Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.
Paper Structure (34 sections, 6 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 6 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: UCI classification accuracy trajectories grouped by noise sensitivity (Sensitive vs Insensitive). Curves are group means across datasets with 95% CI shading, shown separately for uncorrelated and correlated noise.
  • Figure 2: Per-dataset text accuracy vs lexical noise severity. Curves show means across five repetitions, with pointwise 95% confidence intervals.
  • Figure 3: Aggregate text accuracy under increasing lexical noise severity.
  • Figure 4: Clean test-set baseline comparison for classification datasets (sensitive vs insensitive): ECDFs for median accuracy, trial-to-trial standard deviation, and trial range.
  • Figure 5: 2D datapoint and density-shift illustration under noise perturbations Left: clean baseline. Middle: uncorrelated Gaussian noise. Right: correlated Gaussian noise.
  • ...and 1 more figures