Table of Contents
Fetching ...

Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents

Abdellah Ghassel, Xianzhi Li, Xiaodan Zhu

TL;DR

This work addresses the reliability challenges of LLM-driven dialogue by proposing a Detect, Explain, Escalate framework that blends a compact, fine-tuned monitor with frontier-model prompting in a hierarchical, cost-aware deployment. By training an 8B Llama-3.1 model with teacher-generated reasoning traces and leveraging diverse prompting strategies (ZS, CoT, FS variants, AR, CL+AR) for larger models, the system detects breakdowns, provides explanations, and escalates only when needed to restore coherence. Extensive evaluation on DBDC5 (English/Japanese) and BETOLD demonstrates state-of-the-art performance for several prompting setups, strong calibration, and substantial inference-cost reductions—up to 54%—relative to always-on large-model reasoning. The results underscore the practical potential of resource-aware, interpretable monitoring for robust conversational AI and point to future work on finer-grained breakdown taxonomy and improved multilingual adaptation.

Abstract

Large Language Models (LLMs) have demonstrated substantial capabilities in conversational AI applications, yet their susceptibility to dialogue breakdowns poses significant challenges to deployment reliability and user trust. This paper introduces a "Detect, Explain, Escalate" framework to manage dialogue breakdowns in LLM-powered agents, emphasizing resource-efficient operation. Our approach integrates two key strategies: (1) We fine-tune a compact 8B-parameter model, augmented with teacher-generated reasoning traces, which serves as an efficient real-time breakdown detector and explainer. This model demonstrates robust classification and calibration on English and Japanese dialogues, and generalizes to the BETOLD dataset, improving accuracy by 7% over its baseline. (2) We systematically evaluate frontier LLMs using advanced prompting (few-shot, chain-of-thought, analogical reasoning) for high-fidelity breakdown assessment. These are integrated into an "escalation" architecture where our efficient detector defers to larger models only when necessary, substantially reducing operational costs and computational overhead. Our fine-tuned model and prompting strategies achieve state-of-the-art performance on DBDC5 and strong results on BETOLD, outperforming specialized classifiers on DBDC5 and narrowing the performance gap to larger proprietary models. The proposed monitor-escalate pipeline reduces inference costs by 54%, providing a cost-effective and interpretable solution for robust conversational AI in high-impact domains. Code and models will be publicly released.

Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents

TL;DR

This work addresses the reliability challenges of LLM-driven dialogue by proposing a Detect, Explain, Escalate framework that blends a compact, fine-tuned monitor with frontier-model prompting in a hierarchical, cost-aware deployment. By training an 8B Llama-3.1 model with teacher-generated reasoning traces and leveraging diverse prompting strategies (ZS, CoT, FS variants, AR, CL+AR) for larger models, the system detects breakdowns, provides explanations, and escalates only when needed to restore coherence. Extensive evaluation on DBDC5 (English/Japanese) and BETOLD demonstrates state-of-the-art performance for several prompting setups, strong calibration, and substantial inference-cost reductions—up to 54%—relative to always-on large-model reasoning. The results underscore the practical potential of resource-aware, interpretable monitoring for robust conversational AI and point to future work on finer-grained breakdown taxonomy and improved multilingual adaptation.

Abstract

Large Language Models (LLMs) have demonstrated substantial capabilities in conversational AI applications, yet their susceptibility to dialogue breakdowns poses significant challenges to deployment reliability and user trust. This paper introduces a "Detect, Explain, Escalate" framework to manage dialogue breakdowns in LLM-powered agents, emphasizing resource-efficient operation. Our approach integrates two key strategies: (1) We fine-tune a compact 8B-parameter model, augmented with teacher-generated reasoning traces, which serves as an efficient real-time breakdown detector and explainer. This model demonstrates robust classification and calibration on English and Japanese dialogues, and generalizes to the BETOLD dataset, improving accuracy by 7% over its baseline. (2) We systematically evaluate frontier LLMs using advanced prompting (few-shot, chain-of-thought, analogical reasoning) for high-fidelity breakdown assessment. These are integrated into an "escalation" architecture where our efficient detector defers to larger models only when necessary, substantially reducing operational costs and computational overhead. Our fine-tuned model and prompting strategies achieve state-of-the-art performance on DBDC5 and strong results on BETOLD, outperforming specialized classifiers on DBDC5 and narrowing the performance gap to larger proprietary models. The proposed monitor-escalate pipeline reduces inference costs by 54%, providing a cost-effective and interpretable solution for robust conversational AI in high-impact domains. Code and models will be publicly released.

Paper Structure

This paper contains 37 sections, 16 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Sample Zero-Shot Prompt for DBDC5
  • Figure 2: Real-time Response Correction Architecture. The dialogue disruption monitor intercepts potentially unsafe assistant responses, triggering a correction from a superior model before presenting the response to the user.
  • Figure 3: Sensitivity to the escalation threshold. As $T$ increases, safety (breakdown recall) and cost (escalation rate) trace a Pareto-like frontier. The flat segment for $T \le 0.5$ reflects decisive Non-Breakdown predictions; the rising segment for $T>0.5$ shows that computation can be traded for additional safety.
  • Figure 4: BETOLD: CL+AR Prompt
  • Figure 5: Error Analysis: GPT-4 Analogical Reasoning Example on BETOLD