Table of Contents
Fetching ...

Calibrated Language Models and How to Find Them with Label Smoothing

Jerry Huang, Peng Lu, Qiuhao Zeng

TL;DR

This work investigates how supervised fine-tuning (SFT) degrades calibration in open-source large language models and proposes label smoothing (LS) as a practical remedy. It provides a theoretical and empirical analysis of LS, showing that while LS generally improves calibration, its effectiveness diminishes for large-vocabulary models with relatively small hidden sizes due to entropy constraints; it also reveals how temperature scaling and logit capping can recover LS benefits in those cases. To address practical barriers, the authors design memory-efficient GPU kernels for smoothed cross-entropy, enabling LS for LV-LLMs with minimal memory overhead and near-competitive speed. Empirically, LS with a modest smoothing factor ($\beta\approx0.1$) yields improved calibration (lower ECE and RMS-CE) across MMLU, HellaSwag, and ARC-Easy on models like LLaMA3-8B and Mistral-7B, while highlighting the need to account for model size and vocabulary in calibration strategies. Overall, the paper contributes both theoretical insights into LS for calibration in LLMs and a practical kernel-based solution to scale LS to large vocabularies, enhancing reliability without sacrificing accuracy.

Abstract

Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.

Calibrated Language Models and How to Find Them with Label Smoothing

TL;DR

This work investigates how supervised fine-tuning (SFT) degrades calibration in open-source large language models and proposes label smoothing (LS) as a practical remedy. It provides a theoretical and empirical analysis of LS, showing that while LS generally improves calibration, its effectiveness diminishes for large-vocabulary models with relatively small hidden sizes due to entropy constraints; it also reveals how temperature scaling and logit capping can recover LS benefits in those cases. To address practical barriers, the authors design memory-efficient GPU kernels for smoothed cross-entropy, enabling LS for LV-LLMs with minimal memory overhead and near-competitive speed. Empirically, LS with a modest smoothing factor () yields improved calibration (lower ECE and RMS-CE) across MMLU, HellaSwag, and ARC-Easy on models like LLaMA3-8B and Mistral-7B, while highlighting the need to account for model size and vocabulary in calibration strategies. Overall, the paper contributes both theoretical insights into LS for calibration in LLMs and a practical kernel-based solution to scale LS to large vocabularies, enhancing reliability without sacrificing accuracy.

Abstract

Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.

Paper Structure

This paper contains 35 sections, 5 theorems, 42 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Lemma 3.1

Let $f(\cdot;\bm{\theta}):\mathcal{X}\rightarrow[0,1]^K$ be a real-valued function of the form $f({\bm{x}};\bm{\theta})=\sum_{i=1}^{d}f_{i}({\bm{x}}[i];\bm{\theta})$ where $f_{i}(\cdot;\bm{\theta})$ is an arbitrary one-dimensional function, and $f$ is in a hypothesis class $\mathcal{F}$ that has pse where and $\tilde{\Sigma}_{p_{\text{ID}}({\bm{x}})}=\mathbb{E}_{p_{\text{ID}}({\bm{x}})}[\tilde{{\

Figures (10)

  • Figure 1: Reliability diagrams of open-sourced pre-trained models with (red) and without instruction-tuning (blue) on the MMLU dataset mmlu. The horizontal axis represents the model’s confidence in each answer choice for each question, while the vertical axis shows the accuracy on each question. The solid diagonal indicates perfect calibration, separating areas where predictions are deemed over-confident or under-confident. Instruction-tuning visibly leads to over-confidence, regardless of the instruction-tuning dataset (which differs between models).
  • Figure 2: Effects of instruction-tuning on calibration, presented under a number of different calibration error metrics (where lower is better). Values can range from 0 to 100. Models are all fine-tuned on a Tulu3tulu SFT dataset and evaluated on MMLU. We can observe that across all models, which have various structural differences, the use of label smoothing is capable of reducing calibration error while having negligible effects on downstream performance accuracy on the task.
  • Figure 3: Calibration of different LLaMA3 models fine-tuned on the same SFT dataset. As the size of the model decreases, the calibration of the model sees less improvement from the use of LS.
  • Figure 4: Relative entropy bound for different LLM vocabulary sizes with varying hidden sizes ($D$). Our visualization shows the normalized entropy gap for varying hidden sizes of the LM head. This gap is calculated by taking the difference between the entropy upper and lower bounds and dividing by the upper bound ($\log\left|{\bm{V}}\right|$). A lower ratio indicates the model is restricted to producing concentrated predictions.
  • Figure 5: Effect of label smoothing on large vocabulary models with a smaller hidden size (2048). Gemma-2B observes a smaller change compared to LLaMA3.2-1B, due to having the largest vocabulary size. However, Gemma2-2B observes a large change in part thanks to the softcapping of logits.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Lemma 3.1: calibrated_finetuning
  • Definition 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Lemma 4.1
  • Theorem 4.2
  • Remark 4.3
  • Remark 4.4
  • proof
  • proof
  • ...and 2 more