Table of Contents
Fetching ...

BotEval: Facilitating Interactive Human Evaluation

Hyundong Cho, Thamme Gowda, Yuyang Huang, Zixun Lu, Tianli Tong, Jonathan May

TL;DR

BotEval addresses the need for robust evaluation of interactive NLP systems by enabling real-time human-bot interactions within an open-source, configurable toolkit. It provides a modular web application with an evaluation interface, administrator dashboard, and plug-in bot customization, plus built-in crowdsourcing integrations (AMT, Prolific) and YAML-based task configuration. The authors demonstrate its utility through a case study on conversational moderation, showing how multi-turn interactions and POV influence evaluation outcomes and how prompt design affects performance. This work offers a practical foundation for scalable, interactive evaluation of advanced NLP agents, with templates and deployment options that ease adoption in research and crowdsourcing contexts.

Abstract

Following the rapid progress in natural language processing (NLP) models, language models are applied to increasingly more complex interactive tasks such as negotiations and conversation moderations. Having human evaluators directly interact with these NLP models is essential for adequately evaluating the performance on such interactive tasks. We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process, as opposed to human evaluators making judgements for a static input. BotEval balances flexibility for customization and user-friendliness by providing templates for common use cases that span various degrees of complexity and built-in compatibility with popular crowdsourcing platforms. We showcase the numerous useful features of BotEval through a study that evaluates the performance of various chatbots on their effectiveness for conversational moderation and discuss how BotEval differs from other annotation tools.

BotEval: Facilitating Interactive Human Evaluation

TL;DR

BotEval addresses the need for robust evaluation of interactive NLP systems by enabling real-time human-bot interactions within an open-source, configurable toolkit. It provides a modular web application with an evaluation interface, administrator dashboard, and plug-in bot customization, plus built-in crowdsourcing integrations (AMT, Prolific) and YAML-based task configuration. The authors demonstrate its utility through a case study on conversational moderation, showing how multi-turn interactions and POV influence evaluation outcomes and how prompt design affects performance. This work offers a practical foundation for scalable, interactive evaluation of advanced NLP agents, with templates and deployment options that ease adoption in research and crowdsourcing contexts.

Abstract

Following the rapid progress in natural language processing (NLP) models, language models are applied to increasingly more complex interactive tasks such as negotiations and conversation moderations. Having human evaluators directly interact with these NLP models is essential for adequately evaluating the performance on such interactive tasks. We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process, as opposed to human evaluators making judgements for a static input. BotEval balances flexibility for customization and user-friendliness by providing templates for common use cases that span various degrees of complexity and built-in compatibility with popular crowdsourcing platforms. We showcase the numerous useful features of BotEval through a study that evaluates the performance of various chatbots on their effectiveness for conversational moderation and discuss how BotEval differs from other annotation tools.
Paper Structure (19 sections, 5 figures, 1 table)

This paper contains 19 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: A snapshot of the admin point of view of an evaluation interface with a completed evaluation example. The interface is identical for the evaluator except for the text that shows the evaluator's worker ID (hidden with asterisks in the figure for privacy). The three main components of the user-facing interface are the conversation pane, simple instruction pane, and the survey pane.
  • Figure 2: A snapshot of the topics page of the admin dashboard. is a parallel management tool that enables setting global configurations such as how many tasks each evaluator is allowed to complete and launching or deleting multiple tasks at once. is a topics table that shares more information about each topic, such as its name, how many tasks have been created, and when they were created. is a list of parameters that can be chosen for launching a task, which includes parameters that can be passed on to API queries for the bots.
  • Figure 3: BotEval system architecture. We use popular frameworks that are well documented and easy to use.
  • Figure 4: An example of the survey pane configuration that contains a custom Likert scale and freeform text input fields. This configuration corresponds to the survey pane partially shown in \ref{['fig:boteval_interface']}.
  • Figure 5: An example of configuring the consent form. The agreement_file parameter should point to the HTML file that shows the content of the consent form.