Table of Contents
Fetching ...

The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

Mingyi Liu

Abstract

RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p < 10^{-6}). A training stage ablation (Base 0.0% -> SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding -- response homogenization -- is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.

The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

Abstract

RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p < 10^{-6}). A training stage ablation (Base 0.0% -> SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding -- response homogenization -- is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.
Paper Structure (38 sections, 3 figures, 21 tables, 1 algorithm)

This paper contains 38 sections, 3 figures, 21 tables, 1 algorithm.

Figures (3)

  • Figure 1: UCBD framework: four-column architecture mapping brain mechanisms, boundary detectors, detection signals, and response strategies. Solid borders indicate experimentally validated components (B1--B2, cascade B1$\to$B2); dashed borders indicate theoretical components awaiting empirical validation (B3--B5). The Pointer Model (center, PFC analogue) connects to all five boundary detectors---solid arrows for validated boundaries (B1--B2), dashed arrows for theoretical ones (B3--B5)---dispatching queries into the cheapest-first cascade. Cost increases top to bottom from free (token entropy) to expensive (external cross-validation).
  • Figure 2: The alignment tax mechanism (Jaccard clustering, primary metric). On single-cluster questions (40.0%), SE drops to exact chance (0.500, dashed red) because all 10 samples produce the same answer. B1 retains discriminative power (0.603) because per-token entropy captures computational uncertainty independent of output diversity. Under embedding clustering (sensitivity), 79% of questions are single-cluster.
  • Figure 3: AUROC for error detection across tasks (dashed red = chance). The alignment tax is visible on TruthfulQA: B1 (free, 0.599) matches SelfCheckGPT (6$\times$, 0.588, $p$=0.65) and significantly outperforms Jaccard-approximated SE (11$\times$, 0.548, $p_{\text{adj}}$=0.04). On GSM8K, where alignment does not suppress entropy, B1 reaches 0.724 ($d$=0.81).