Table of Contents
Fetching ...

Confidence Estimation for LLM-Based Dialogue State Tracking

Yi-Jyun Sun, Suvodip Dey, Dilek Hakkani-Tur, Gokhan Tur

TL;DR

This work studies confidence estimation for LLM-based dialogue state tracking (DST) in task-oriented dialogue systems, addressing reliability and hallucination risk. It compares open-weight and closed-weight approaches, introducing verbalized confidence and self-probing prompts, and combines multiple signals via a linear model to calibrate slot-value confidences. Fine-tuning open-weight models on MultiWOZ improves both DST performance and confidence calibration, with the combined confidence score achieving strong calibration (e.g., $\mathrm{AUC} \approx 0.725$, $\mathrm{ECE} \approx 0.018$) and competitive DST accuracy. The results suggest practical deployment guidance, highlighting turn-level self-probing as an efficient calibration method and planning future work to integrate confidence into dialogue policy decisions and broader datasets.

Abstract

Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.

Confidence Estimation for LLM-Based Dialogue State Tracking

TL;DR

This work studies confidence estimation for LLM-based dialogue state tracking (DST) in task-oriented dialogue systems, addressing reliability and hallucination risk. It compares open-weight and closed-weight approaches, introducing verbalized confidence and self-probing prompts, and combines multiple signals via a linear model to calibrate slot-value confidences. Fine-tuning open-weight models on MultiWOZ improves both DST performance and confidence calibration, with the combined confidence score achieving strong calibration (e.g., , ) and competitive DST accuracy. The results suggest practical deployment guidance, highlighting turn-level self-probing as an efficient calibration method and planning future work to integrate confidence into dialogue policy decisions and broader datasets.

Abstract

Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. We also enhance these with a self-probing mechanism, proposed for closed models. Furthermore, we assess these methods using an open-weight model fine-tuned for the task of DST, achieving superior joint goal accuracy (JGA). Our findings also suggest that fine-tuning open-weight LLMs can result in enhanced AUC performance, indicating better confidence score calibration.
Paper Structure (38 sections, 7 equations, 2 figures, 5 tables)

This paper contains 38 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Example interaction in our TODS approach, showing DST and its outputs with confidence scores.
  • Figure 2: An example demonstrating the individual and combined confidence scores.