Table of Contents
Fetching ...

LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

Xin Liu, Muhammad Khalifa, Lu Wang

TL;DR

This work tackles LM hallucinations by proposing LitCab, a lightweight on-top calibration layer that adds <2% of parameters and uses a max-margin objective to adjust final-layer logits. To evaluate calibration across outputs of varying lengths, the authors introduce CaT, a benchmark spanning phrase-, sentence-, and paragraph-level generations across multiple open-source LMs. Empirical results show LitCab consistently improves calibration over traditional post-processing and training-based baselines, and often outperforms LM confidence estimators, with compatibility observed for short-form tasks and limited data efficiency. Key findings reveal nuanced relationships between model scale, calibration, and task length, including that larger models within a family can calibrate better for short outputs but not for longer ones, and that instruction-tuning can sometimes degrade calibration.

Abstract

A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.

LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

TL;DR

This work tackles LM hallucinations by proposing LitCab, a lightweight on-top calibration layer that adds <2% of parameters and uses a max-margin objective to adjust final-layer logits. To evaluate calibration across outputs of varying lengths, the authors introduce CaT, a benchmark spanning phrase-, sentence-, and paragraph-level generations across multiple open-source LMs. Empirical results show LitCab consistently improves calibration over traditional post-processing and training-based baselines, and often outperforms LM confidence estimators, with compatibility observed for short-form tasks and limited data efficiency. Key findings reveal nuanced relationships between model scale, calibration, and task length, including that larger models within a family can calibrate better for short outputs but not for longer ones, and that instruction-tuning can sometimes degrade calibration.

Abstract

A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LitCab, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LitCab improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CaT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LitCab with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.
Paper Structure (29 sections, 2 equations, 4 figures, 4 tables)

This paper contains 29 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Calibration techniques for neural models. LitCab combines the benefits of computational efficiency with great flexibility.
  • Figure 2: The 4-step process for evaluating calibration over paragraph-level generations by breaking the text down into individual claims and then estimating confidence and judging correctness for each claim separately. Step 1: Individual claims are extracted using GPT-3.5-turbo. Step 2: The extracted claims are mapped back to the corresponding spans in the paragraph. Step 3: The confidence of each claim is estimated by aggregating probabilities over tokens in the corresponding span. Step 4: The correctness of each claim is determined by GPT-4 as whether the claim is supported by the retrieved Wikipedia passages.
  • Figure 3: Left: The process of constructing positive and negative samples. Right:LitCab Training. LitCab adjusts the LM predicted logits of the last layer's hidden states, with parameters trained using a max-margin objective.
  • Figure 4: Bar charts of averaged acc@50, ECE, and Brier score of popular LMs computed on CaT. The results for GPT-2 XL in paragraph-level tasks are missing due to the prompt's length exceeding its context limit. Bars with the same color represent models from the same model family.