Confidence Estimation for LLM-Based Dialogue State Tracking
Yi-Jyun Sun, Suvodip Dey, Dilek Hakkani-Tur, Gokhan Tur
TL;DR
This work studies confidence estimation for LLM-based dialogue state tracking (DST) in task-oriented dialogue systems, addressing reliability and hallucination risk. It compares open-weight and closed-weight approaches, introducing verbalized confidence and self-probing prompts, and combines multiple signals via a linear model to calibrate slot-value confidences. Fine-tuning open-weight models on MultiWOZ improves both DST performance and confidence calibration, with the combined confidence score achieving strong calibration (e.g., $\mathrm{AUC} \approx 0.725$, $\mathrm{ECE} \approx 0.018$) and competitive DST accuracy. The results suggest practical deployment guidance, highlighting turn-level self-probing as an efficient calibration method and planning future work to integrate confidence into dialogue policy decisions and broader datasets.
Abstract
Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
