Table of Contents
Fetching ...

Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

Ej Zhou, Caiqi Zhang, Tiancheng Hu, Chengzu Li, Nigel Collier, Ivan Vulić, Anna Korhonen

TL;DR

This work tackles multilingual calibration for large language models, revealing that English-centric final-layer signals degrade confidence estimates for non-English languages. Through a layer-wise analysis, the authors show that late-intermediate layers provide more reliable calibration signals for multilingual inputs, while English calibration benefits from deeper final layers. They introduce training-free calibration methods—Best Layer, Good Layers Ensemble, and Language-Aware Confidence Ensemble (LACE)—that exploit intermediate representations and can complement traditional post-hoc techniques. Empirical results on MMMLU and Belebele across six model families demonstrate substantial improvements in cross-lingual calibration and point toward a path for more globally equitable and trustworthy LLMs by looking beyond the final layer.

Abstract

Confidence calibration, the alignment of a model's predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model's internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.

Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

TL;DR

This work tackles multilingual calibration for large language models, revealing that English-centric final-layer signals degrade confidence estimates for non-English languages. Through a layer-wise analysis, the authors show that late-intermediate layers provide more reliable calibration signals for multilingual inputs, while English calibration benefits from deeper final layers. They introduce training-free calibration methods—Best Layer, Good Layers Ensemble, and Language-Aware Confidence Ensemble (LACE)—that exploit intermediate representations and can complement traditional post-hoc techniques. Empirical results on MMMLU and Belebele across six model families demonstrate substantial improvements in cross-lingual calibration and point toward a path for more globally equitable and trustworthy LLMs by looking beyond the final layer.

Abstract

Confidence calibration, the alignment of a model's predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model's internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.

Paper Structure

This paper contains 47 sections, 6 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Relationship between resource level and Brier score for the LLaMA3 model on the Belebele benchmark. Each point represents a language, and same colour indicates same writing system. Correlations: Spearman $\rho=-0.59$, $p<10^{-8}$; Kendall $\tau=-0.43$, $p<10^{-8}$; Pearson $r=-0.39$, $p<0.001$; indicating that higher-resourced languages tend to achieve better calibration.
  • Figure 2: Confidence distributions for English v.s. Non-English samples in (a) LLaMA3 and (b) Aya models. The histograms show the density of model confidence scores. The overall distributions differ substantially between English and Non-English inputs in LLaMA3, and the gap between confidence (dashed lines) and accuracy (solid lines) is much larger for Aya.
  • Figure 3: ECE v.s. entropy across layers on the MMMLU subset for LLaMA3 and Aya. In the multilingual setting, many languages achieve their best ECE in intermediate layers (e.g., 25-32 for LLaMA3 and 26-32 for Aya), after which calibration quality degrades towards the final layer. This contrasts with the English-only setting, where calibration improves monotonically (see Figure \ref{['fig:english_llama3_calibration_vs_entropy_ece']}). Notably, the sweet spot in calibration coincides with the sharp drop in entropy.
  • Figure 4: Per-language calibration reliability diagrams for LLaMA3. Each panel shows a reliability histogram with evenly spaced confidence bins. Blue bars correspond to the chosen intermediate layer (Layer 29), and orange bars correspond to the original final layer. The dashed diagonal is the perfectly calibrated line ($y{=}\,x$). Hatched overlays indicate the absolute calibration gap within each bin. The inset reports ECE (%) for both layers and the change $\Delta\text{ECE} = \text{ECE}_{\text{Final}} - \text{ECE}_{29}$ (positive values denote improved calibration at Layer 29).
  • Figure 5: Forest plot of average ECE in MMMLU, with means and 95% CIs.
  • ...and 10 more figures