Table of Contents
Fetching ...

Mitigating Label Length Bias in Large Language Models

Mario Sanz-Guerrero, Katharina von der Wense

TL;DR

This work identifies label length bias as a critical shortcoming in LLM-based classification where multi-token labels are biased against by the joint probability over tokens. It introduces Normalized Contextual Calibration (NCC), which first normalizes full-label probabilities by label length using a geometric mean and then calibrates them with priors from content-free inputs, producing calibrated full-label distributions. Across eight multi-token-label datasets and multiple model families, NCC yields up to 10% absolute gains in macro-F1 and improves reliability of predictions, while reducing sensitivity to few-shot example selection and extending to multiple-choice QA tasks. The approach demonstrates that addressing full-label biases is essential for robust, real-world LLM applications, especially where class labels are multi-token expressions, and highlights limitations related to access to token-level probabilities and open-ended outputs.

Abstract

Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.

Mitigating Label Length Bias in Large Language Models

TL;DR

This work identifies label length bias as a critical shortcoming in LLM-based classification where multi-token labels are biased against by the joint probability over tokens. It introduces Normalized Contextual Calibration (NCC), which first normalizes full-label probabilities by label length using a geometric mean and then calibrates them with priors from content-free inputs, producing calibrated full-label distributions. Across eight multi-token-label datasets and multiple model families, NCC yields up to 10% absolute gains in macro-F1 and improves reliability of predictions, while reducing sensitivity to few-shot example selection and extending to multiple-choice QA tasks. The approach demonstrates that addressing full-label biases is essential for robust, real-world LLM applications, especially where class labels are multi-token expressions, and highlights limitations related to access to token-level probabilities and open-ended outputs.

Abstract

Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.

Paper Structure

This paper contains 33 sections, 4 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: NCC enables the use of calibration in multi-token tasks, mitigating label biases of LLMs and improving their performance. The figure shows real numbers obtained with the Llama 3.1 (8B) model.
  • Figure 2: Prediction frequency by label using zero-shot Llama 3.1 (8B) in the Yahoo dataset agnews_dbpedia_yahoo_dataset.
  • Figure 3: Few-shot performance of methods and models, averaged over all datasets. Error bars indicate the standard deviation across runs. Numbers above the bars indicate absolute improvement of NCC over the second-best method.
  • Figure 4: Zero-shot performance of methods and models, averaged over all datasets. Numbers indicate absolute improvement of NCC over the second-best method.
  • Figure 5: Average performance of all methods in the zero-shot and few-shot setting. Numbers indicate relative improvement when going from 0 to 5 shots.
  • ...and 7 more figures