Table of Contents
Fetching ...

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

Elias Stengel-Eskin, Peter Hase, Mohit Bansal

TL;DR

The paper tackles the problem of overconfident and sometimes untruthful outputs from large language models by introducing LACIE, a listener-aware finetuning framework that optimizes for how a hypothetical listener would judge an answer. By constructing a two-agent data generation loop and applying Direct Preference Optimization to induce calibrated listener responses, the authors show substantial improvements in induced listener calibration across multiple models and even human evaluators. Key findings include large gains in AUROC and precision, reduced false accepts by humans, and robust transfer to out-of-domain data such as TruthfulQA, accompanied by emergent abstention behavior on uncertain items. The work advances safer and more trustworthy AI-assisted information seeking and demonstrates the practical value of incorporating pragmatics and listener modeling into calibrating model confidence.

Abstract

When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise; however, most current models tend towards overconfidence. To calibrate both implicit and explicit confidence markers, we introduce a pragmatic, listener-aware finetuning method (LACIE) that models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We cast calibration as preference optimization, creating data via a two-agent game, where a speaker model's outputs are judged by a simulated listener. We then finetune three LLMs (Mistral-7B, Llama3-8B, Llama3-70B) with LACIE, and show that the resulting models are better calibrated w.r.t. a simulated listener. Crucially, these trends transfer to human listeners, helping them correctly predict model correctness: we conduct a human evaluation where annotators accept or reject an LLM's answers, finding that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers. Furthermore, LACIE generalizes to another dataset, resulting in a large increase in truthfulness on TruthfulQA when trained on TriviaQA. Our analysis indicates that LACIE leads to a better confidence separation between correct and incorrect examples. Qualitatively, we find that a LACIE-trained model hedges more and implicitly signals certainty when it is correct by using an authoritative tone or including details. Finally, LACIE finetuning leads to an emergent increase in model abstention (e.g. saying "I don't know") for answers that are likely wrong.

LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

TL;DR

The paper tackles the problem of overconfident and sometimes untruthful outputs from large language models by introducing LACIE, a listener-aware finetuning framework that optimizes for how a hypothetical listener would judge an answer. By constructing a two-agent data generation loop and applying Direct Preference Optimization to induce calibrated listener responses, the authors show substantial improvements in induced listener calibration across multiple models and even human evaluators. Key findings include large gains in AUROC and precision, reduced false accepts by humans, and robust transfer to out-of-domain data such as TruthfulQA, accompanied by emergent abstention behavior on uncertain items. The work advances safer and more trustworthy AI-assisted information seeking and demonstrates the practical value of incorporating pragmatics and listener modeling into calibrating model confidence.

Abstract

When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise; however, most current models tend towards overconfidence. To calibrate both implicit and explicit confidence markers, we introduce a pragmatic, listener-aware finetuning method (LACIE) that models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We cast calibration as preference optimization, creating data via a two-agent game, where a speaker model's outputs are judged by a simulated listener. We then finetune three LLMs (Mistral-7B, Llama3-8B, Llama3-70B) with LACIE, and show that the resulting models are better calibrated w.r.t. a simulated listener. Crucially, these trends transfer to human listeners, helping them correctly predict model correctness: we conduct a human evaluation where annotators accept or reject an LLM's answers, finding that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers. Furthermore, LACIE generalizes to another dataset, resulting in a large increase in truthfulness on TruthfulQA when trained on TriviaQA. Our analysis indicates that LACIE leads to a better confidence separation between correct and incorrect examples. Qualitatively, we find that a LACIE-trained model hedges more and implicitly signals certainty when it is correct by using an authoritative tone or including details. Finally, LACIE finetuning leads to an emergent increase in model abstention (e.g. saying "I don't know") for answers that are likely wrong.
Paper Structure (39 sections, 1 equation, 6 figures, 8 tables)

This paper contains 39 sections, 1 equation, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (A) A non-expert listener (who does not know the answer to the question already) accepts or rejects answers based on how confident they sound. This confidence is influenced by implicit and explicit markers. (B) To calibrate a speaker model's confidence, we train a listener-aware speaker model by bootstrapping data from a base speaker model. For each training question, we generate $k$ diverse responses. These are scored for correctness against the gold answers and accepted or rejected by a listener model. Our preference function rewards true accepts and true rejects and penalizes false accepts and false rejects. (C) Before training, models tend to be confident regardless of whether they are right or wrong. After training, listener-aware models are more confident when they are correct and less confident when they are wrong.
  • Figure 2: Induced listener probabilities for LACIE-trained and baseline models (Mistral-7B). Baselines have similar scores for correct and incorrect examples; LACIE results in significantly lower scores for incorrect answers.
  • Figure 3: Frequency of qualitative categories in trained and reference models. LACIE training results in more hedging and abstaining for incorrect examples and more detailed answers for correct ones.
  • Figure 4: Precision and AUROC as the size of the training data increases. LACIE generally continues improving with more data.
  • Figure 5: Annotation instructions.
  • ...and 1 more figures