Table of Contents
Fetching ...

Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch Support Chatbot

Herman Lassche, Michiel Overeem, Ayushi Rastogi

TL;DR

This work defines correctness for an LLM-based Dutch support chatbot as a combination of truthfulness, relatedness, and completeness, prioritizing truthfulness for automated evaluation and grounding it in real AFAS data. The authors construct a manual decision tree to mirror human judgment, derive heuristics, and implement a rule-based, feature-driven scoring system to assess answers for Binary and Instruction-type questions, achieving up to 0.96 F1 on type-prediction and up to 0.37 Spearman correlation with human judgments in English translations. They demonstrate that their approach can identify wrong messages in about 55% of cases and, with practical guardrails, save substantial human effort (estimated around 15,000 hours per year) while improving near-real-time Dutch support. The study offers a structured, transferable methodology—rooted in domain-specific heuristics and a human-in-the-loop process—that can be adapted to other domains and languages to evaluate and improve chatbot correctness and reliability.

Abstract

Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55\% of the cases. This work demonstrates the potential for automatically assessing when our chatbot may provide incorrect or misleading answers. Specifically, we contribute (1) a definition and metrics for assessing correctness, and (2) suggestions to improve correctness with respect to regional language and question type.

Is Our Chatbot Telling Lies? Assessing Correctness of an LLM-based Dutch Support Chatbot

TL;DR

This work defines correctness for an LLM-based Dutch support chatbot as a combination of truthfulness, relatedness, and completeness, prioritizing truthfulness for automated evaluation and grounding it in real AFAS data. The authors construct a manual decision tree to mirror human judgment, derive heuristics, and implement a rule-based, feature-driven scoring system to assess answers for Binary and Instruction-type questions, achieving up to 0.96 F1 on type-prediction and up to 0.37 Spearman correlation with human judgments in English translations. They demonstrate that their approach can identify wrong messages in about 55% of cases and, with practical guardrails, save substantial human effort (estimated around 15,000 hours per year) while improving near-real-time Dutch support. The study offers a structured, transferable methodology—rooted in domain-specific heuristics and a human-in-the-loop process—that can be adapted to other domains and languages to evaluate and improve chatbot correctness and reliability.

Abstract

Companies support their customers using live chats and chatbots to gain their loyalty. AFAS is a Dutch company aiming to leverage the opportunity large language models (LLMs) offer to answer customer queries with minimal to no input from its customer support team. Adding to its complexity, it is unclear what makes a response correct, and that too in Dutch. Further, with minimal data available for training, the challenge is to identify whether an answer generated by a large language model is correct and do it on the fly. This study is the first to define the correctness of a response based on how the support team at AFAS makes decisions. It leverages literature on natural language generation and automated answer grading systems to automate the decision-making of the customer support team. We investigated questions requiring a binary response (e.g., Would it be possible to adjust tax rates manually?) or instructions (e.g., How would I adjust tax rate manually?) to test how close our automated approach reaches support rating. Our approach can identify wrong messages in 55\% of the cases. This work demonstrates the potential for automatically assessing when our chatbot may provide incorrect or misleading answers. Specifically, we contribute (1) a definition and metrics for assessing correctness, and (2) suggestions to improve correctness with respect to regional language and question type.

Paper Structure

This paper contains 32 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Shows the current and desired flow for handling customer queries. In the current workflow, the support team is an intermediate for providing context* comprising of relevant documents and instructions for the large language model and later assessing the response (see 'Chatbot' and 'Large Language Model'). Using parts 'Google Translator' and 'Automated Rating', we envision replacing human feedback with automated ratings
  • Figure 2: A snippet of the decision tree, indicating whether an answer would be true or not
  • Figure 3: Total message-answer pairs by mistake type. The x-axis shows the message type, and the y-axis shows the number of messages per type
  • Figure 4: Illustration of the scoring process, where scores range from 1 to 5. A score of 1 is assigned if any component is missing, and 5 if a guide in the answer matches one in the context. Intermediate scores are obtained by summing the outputs of the other metrics