Calibrated Language Models and How to Find Them with Label Smoothing

Jerry Huang; Peng Lu; Qiuhao Zeng

Calibrated Language Models and How to Find Them with Label Smoothing

Jerry Huang, Peng Lu, Qiuhao Zeng

TL;DR

This work investigates how supervised fine-tuning (SFT) degrades calibration in open-source large language models and proposes label smoothing (LS) as a practical remedy. It provides a theoretical and empirical analysis of LS, showing that while LS generally improves calibration, its effectiveness diminishes for large-vocabulary models with relatively small hidden sizes due to entropy constraints; it also reveals how temperature scaling and logit capping can recover LS benefits in those cases. To address practical barriers, the authors design memory-efficient GPU kernels for smoothed cross-entropy, enabling LS for LV-LLMs with minimal memory overhead and near-competitive speed. Empirically, LS with a modest smoothing factor ($\beta\approx0.1$) yields improved calibration (lower ECE and RMS-CE) across MMLU, HellaSwag, and ARC-Easy on models like LLaMA3-8B and Mistral-7B, while highlighting the need to account for model size and vocabulary in calibration strategies. Overall, the paper contributes both theoretical insights into LS for calibration in LLMs and a practical kernel-based solution to scale LS to large vocabularies, enhancing reliability without sacrificing accuracy.

Abstract

Recent advances in natural language processing (NLP) have opened up greater opportunities to enable fine-tuned large language models (LLMs) to behave as more powerful interactive agents through improved instruction-following ability. However, understanding how this impacts confidence calibration for reliable model output has not been researched in full. In this work, we examine various open-sourced LLMs, identifying significant calibration degradation after instruction tuning in each. Seeking a practical solution, we look towards label smoothing, which has been shown as an effective method to regularize for overconfident predictions but has yet to be widely adopted in the supervised fine-tuning (SFT) of LLMs. We first provide insight as to why label smoothing is sufficient to maintain calibration throughout the SFT process. However, settings remain where the effectiveness of smoothing is severely diminished, in particular the case of large vocabulary LLMs (LV-LLMs). We posit the cause to stem from the ability to become over-confident, which has a direct relationship with the hidden size and vocabulary size, and justify this theoretically and experimentally. Finally, we address an outstanding issue regarding the memory footprint of the cross-entropy loss computation in the label smoothed loss setting, designing a customized kernel to dramatically reduce memory consumption without sacrificing speed or performance in comparison to existing solutions for non-smoothed losses.

Calibrated Language Models and How to Find Them with Label Smoothing

TL;DR

Abstract

Calibrated Language Models and How to Find Them with Label Smoothing

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (12)