Table of Contents
Fetching ...

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Miranda Muqing Miao, Young-Min Cho, Lyle Ungar

TL;DR

CORAL introduces a correctness-optimized residual activation lens for inference-time steering in LLMs. It trains a regularized MLP probe on frozen residual activations to predict residual correctness and steers outputs without updating model weights, directly optimizing calibration via the Brier score. Across three 7B-model families, CORAL improves accuracy by about 10% and reduces ECE by ~50% in-distribution, and transfers to four held-out MCQA benchmarks with around 14% accuracy and 49% ECE gains, indicating a general, transferable correctness subspace. The approach is compute-efficient (training on ~8.4k–10k questions) and highlights the need for distributed-information aggregation rather than sparse feature locality for robust calibration-enhanced MCQA performance.

Abstract

Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

TL;DR

CORAL introduces a correctness-optimized residual activation lens for inference-time steering in LLMs. It trains a regularized MLP probe on frozen residual activations to predict residual correctness and steers outputs without updating model weights, directly optimizing calibration via the Brier score. Across three 7B-model families, CORAL improves accuracy by about 10% and reduces ECE by ~50% in-distribution, and transfers to four held-out MCQA benchmarks with around 14% accuracy and 49% ECE gains, indicating a general, transferable correctness subspace. The approach is compute-efficient (training on ~8.4k–10k questions) and highlights the need for distributed-information aggregation rather than sparse feature locality for robust calibration-enhanced MCQA performance.

Abstract

Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10\% and expected calibration error (ECE) by 50\% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14\% accuracy improvements and 49\% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
Paper Structure (70 sections, 28 equations, 5 figures, 2 tables)

This paper contains 70 sections, 28 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of CORAL. Given an MCQA question, a frozen LLM produces per-option hidden states and base probabilities. Hidden states are mean-pooled over answer tokens, z-score normalized, and passed through a trained probe to predict residual correctness. At inference, predicted residuals are centered and used to steer base probabilities toward better calibration.
  • Figure 2: Layer-wise steering performance for DeepSeek-7B-Chat. Left: accuracy across layers 0--30 for lm-eval harness baseline (dotted red lines) and CORAL performance (green bars). Right: corresponding ECE values.
  • Figure 3: Distribution of single-neuron ablation impacts on ECE (left) and accuracy (right) for 300 SAE features selected by activation frequency and correlation with residual correctness. Shaded boxes show interquartile ranges and horizontal lines mark medians. Individual features produce negligible causal effects.
  • Figure 4: Distribution of calibration signal across attention heads. Left: Histogram of $R^2$ scores from 4-layer MLP probes trained on individual attention head activations to predict residual correctness. The mean $R^2 = 0.022$ and maximum $R^2 = 0.085$, with no head exceeding $R^2 > 0.10$. Right: Cumulative signal analysis showing that 526 heads (55% of all 960 heads) are required to capture 80% of the total predictive signal, indicating that calibration information is distributed across the attention mechanism rather than localized to specific heads.
  • Figure 5: Signal dimensionality analysis across layers.Left: Cross-validated $R^2$ for predicting residual correctness as a function of the number of PCA components, averaged over 5-layer groups. Later layers (L20--29) achieve higher predictive power, but all layer groups show gradual $R^2$ growth without saturation, indicating that the calibration signal is distributed across many dimensions rather than concentrated in a low-rank subspace. Right: Cumulative explained variance of PCA components. Early layers exhibit highly concentrated activation variance (3 components capture $>$90%), while later layers have more distributed representations. The mismatch between variance concentration and predictive power suggests calibration information resides in subtle activation patterns rather than dominant principal directions.