End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Nhi Dang; Tung Le; Huy Tien Nguyen

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Nhi Dang, Tung Le, Huy Tien Nguyen

TL;DR

This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention by proposing an end-to-end automatic evaluator designed to substantially reduce human effort.

Abstract

Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 6 figures, 1 table)

This paper contains 18 sections, 2 equations, 6 figures, 1 table.

Introduction
Related work
Automatic test data generation with LLMs
LLMs as automatic judges of model outputs
Uncertainty quantification and Filtering in LLM evaluations
Evaluation Frameworks
Proposed method
Automatic test data generation
LLM-Based automatic evaluation
Single Prompt
Sequential Decision
Adaptive K-step Reasoning
Uncertainty quantification
Experiment setup
Evaluation and Analysis
...and 3 more sections

Figures (6)

Figure 1: Overall system pipeline. The framework consists of three main components: (1) automatic test data generation, where LLMs generate questions and expected answers from an article database and query the target chatbot for responses; (2) LLM-based evaluation, where another LLM judges the correctness of responses and outputs both labels and verbal confidence scores; and (3) uncertainty quantification, which computes an aggregated confidence and filters out low-confidence samples for human review.
Figure 2: Automatic Test Data Generation Pipeline. Given an article from the input database, an LLM is prompted to generate n pairs of question and expected answer. The questions are used to query the target chatbot, while the expected answers serve as ground truth for later evaluation.
Figure 3: Single Prompt Evaluation Method. The LLM receives a question, expected answer, and chatbot-generated answer, then directly returns a label (TRUE, FALSE, or NOT GIVEN) based on a predefined prompt template.
Figure 4: Sequential Decision Evaluation Method. The LLM evaluates the received answer in a step-by-step manner. It first checks whether the answer refuses to respond, then compares it with the expected answer to classify the content as incorrect, equivalent, missing, or excessive. If additional or missing information is detected, the model decides whether it changes the core meaning. The final output is a label: TRUE, FALSE, or NOT GIVEN
Figure 5: Adaptive K-step Reasoning Evaluation Method. The LLM is prompted to evaluate the received answer compared to the expected answer by reasoning through up to K self-defined steps. At each step, it provides a judgment, an explanation, and a confidence score. The final output includes the label (TRUE, FALSE, or NOT GIVEN), the overall confidence (0–1), and a rationale for the decision.
...and 1 more figures

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

TL;DR

Abstract

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Authors

TL;DR

Abstract

Table of Contents

Figures (6)