Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Maxim Khomiakov; Jes Frellsen

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Maxim Khomiakov, Jes Frellsen

Abstract

Large language models (LLMs) are increasingly used as automated judges and synthetic labelers, especially in low-label settings. Yet these systems are stochastic and often overconfident, which makes deployment decisions difficult when external ground truth is limited. We propose a practical calibration protocol based on controlled input interventions: if noise severity increases, task performance should exhibit a statistically significant deterioration trend. We operationalize this with a slope-based hypothesis test over repeated trials, using signal-to-noise-ratio (SNR) perturbations for tabular data and lexical perturbations for text data. Across UCI tabular benchmarks and four text classification datasets, we find clear modality-dependent behavior. Our results reveal a modality gap: while text-based judges degrade predictably, the majority of tabular datasets show a lack of statistically significant performance deterioration even under significant signal-to-noise reduction. Interestingly we find that model performance is lower on datasets that are insensitive to noise interventions. We present a reproducible methodology and reporting protocol for robust LLM-judge calibration under distribution shift.

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Abstract

Paper Structure (34 sections, 6 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 6 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Core question.
Overview of Contributions
Related works
Hallucination, alignment, and judge reliability.
Perturbation sensitivity in reasoning and NLP.
Causal framing and benchmark context.
Method
Tabular noise via SNR schedules
Text datasets
Noise intervention design.
Lexical noise.
Deterioration analysis
Experimental Setup
Datasets and tasks
...and 19 more sections

Figures (6)

Figure 1: UCI classification accuracy trajectories grouped by noise sensitivity (Sensitive vs Insensitive). Curves are group means across datasets with 95% CI shading, shown separately for uncorrelated and correlated noise.
Figure 2: Per-dataset text accuracy vs lexical noise severity. Curves show means across five repetitions, with pointwise 95% confidence intervals.
Figure 3: Aggregate text accuracy under increasing lexical noise severity.
Figure 4: Clean test-set baseline comparison for classification datasets (sensitive vs insensitive): ECDFs for median accuracy, trial-to-trial standard deviation, and trial range.
Figure 5: 2D datapoint and density-shift illustration under noise perturbations Left: clean baseline. Middle: uncorrelated Gaussian noise. Right: correlated Gaussian noise.
...and 1 more figures

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Abstract

Noise-Response Calibration: A Causal Intervention Protocol for LLM-Judges

Authors

Abstract

Table of Contents

Figures (6)