LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models
Elias Stengel-Eskin, Peter Hase, Mohit Bansal
TL;DR
The paper tackles the problem of overconfident and sometimes untruthful outputs from large language models by introducing LACIE, a listener-aware finetuning framework that optimizes for how a hypothetical listener would judge an answer. By constructing a two-agent data generation loop and applying Direct Preference Optimization to induce calibrated listener responses, the authors show substantial improvements in induced listener calibration across multiple models and even human evaluators. Key findings include large gains in AUROC and precision, reduced false accepts by humans, and robust transfer to out-of-domain data such as TruthfulQA, accompanied by emergent abstention behavior on uncertain items. The work advances safer and more trustworthy AI-assisted information seeking and demonstrates the practical value of incorporating pragmatics and listener modeling into calibrating model confidence.
Abstract
When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise; however, most current models tend towards overconfidence. To calibrate both implicit and explicit confidence markers, we introduce a pragmatic, listener-aware finetuning method (LACIE) that models the listener, considering not only whether an answer is right, but whether it will be accepted by a listener. We cast calibration as preference optimization, creating data via a two-agent game, where a speaker model's outputs are judged by a simulated listener. We then finetune three LLMs (Mistral-7B, Llama3-8B, Llama3-70B) with LACIE, and show that the resulting models are better calibrated w.r.t. a simulated listener. Crucially, these trends transfer to human listeners, helping them correctly predict model correctness: we conduct a human evaluation where annotators accept or reject an LLM's answers, finding that training with LACIE results in 47% fewer incorrect answers being accepted while maintaining the same level of acceptance for correct answers. Furthermore, LACIE generalizes to another dataset, resulting in a large increase in truthfulness on TruthfulQA when trained on TriviaQA. Our analysis indicates that LACIE leads to a better confidence separation between correct and incorrect examples. Qualitatively, we find that a LACIE-trained model hedges more and implicitly signals certainty when it is correct by using an authoritative tone or including details. Finally, LACIE finetuning leads to an emergent increase in model abstention (e.g. saying "I don't know") for answers that are likely wrong.
