Table of Contents
Fetching ...

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Nhi Dang, Tung Le, Huy Tien Nguyen

TL;DR

This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention by proposing an end-to-end automatic evaluator designed to substantially reduce human effort.

Abstract

Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

TL;DR

This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention by proposing an end-to-end automatic evaluator designed to substantially reduce human effort.

Abstract

Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q\&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.
Paper Structure (18 sections, 2 equations, 6 figures, 1 table)

This paper contains 18 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overall system pipeline. The framework consists of three main components: (1) automatic test data generation, where LLMs generate questions and expected answers from an article database and query the target chatbot for responses; (2) LLM-based evaluation, where another LLM judges the correctness of responses and outputs both labels and verbal confidence scores; and (3) uncertainty quantification, which computes an aggregated confidence and filters out low-confidence samples for human review.
  • Figure 2: Automatic Test Data Generation Pipeline. Given an article from the input database, an LLM is prompted to generate n pairs of question and expected answer. The questions are used to query the target chatbot, while the expected answers serve as ground truth for later evaluation.
  • Figure 3: Single Prompt Evaluation Method. The LLM receives a question, expected answer, and chatbot-generated answer, then directly returns a label (TRUE, FALSE, or NOT GIVEN) based on a predefined prompt template.
  • Figure 4: Sequential Decision Evaluation Method. The LLM evaluates the received answer in a step-by-step manner. It first checks whether the answer refuses to respond, then compares it with the expected answer to classify the content as incorrect, equivalent, missing, or excessive. If additional or missing information is detected, the model decides whether it changes the core meaning. The final output is a label: TRUE, FALSE, or NOT GIVEN
  • Figure 5: Adaptive K-step Reasoning Evaluation Method. The LLM is prompted to evaluate the received answer compared to the expected answer by reasoning through up to K self-defined steps. At each step, it provides a judgment, an explanation, and a confidence score. The final output includes the label (TRUE, FALSE, or NOT GIVEN), the overall confidence (0–1), and a rationale for the decision.
  • ...and 1 more figures