Table of Contents
Fetching ...

AI Revolution on Chat Bot: Evidence from a Randomized Controlled Experiment

Sida Peng, Wojciech Swiatek, Allen Gao, Paul Cullivan, Haoge Chang

TL;DR

This study addresses whether LLM-based tools improve real-world information retrieval support by conducting a field randomized controlled trial comparing a GPT-based bot to a flowchart-based PowerVA bot for Microsoft developers. The primary outcome is escalation to back-end engineers, and the GPT-based bot achieves a substantial reduction in escalation while maintaining similar engagement to the baseline; analyses also compare GPT-3.5 and GPT-4, revealing no significant escalation difference but differing token usage and costs. Regression analyses using HC2-standard errors confirm the robustness of the main finding, with additional robustness checks on first-session data supporting the escalation reduction. The work demonstrates practical productivity gains from GPT-based support in a realistic enterprise setting and provides guidance on cost considerations for different GPT models over formal field deployments.

Abstract

In recent years, generative AI has undergone major advancements, demonstrating significant promise in augmenting human productivity. Notably, large language models (LLM), with ChatGPT-4 as an example, have drawn considerable attention. Numerous articles have examined the impact of LLM-based tools on human productivity in lab settings and designed tasks or in observational studies. Despite recent advances, field experiments applying LLM-based tools in realistic settings are limited. This paper presents the findings of a field randomized controlled trial assessing the effectiveness of LLM-based tools in providing unmonitored support services for information retrieval.

AI Revolution on Chat Bot: Evidence from a Randomized Controlled Experiment

TL;DR

This study addresses whether LLM-based tools improve real-world information retrieval support by conducting a field randomized controlled trial comparing a GPT-based bot to a flowchart-based PowerVA bot for Microsoft developers. The primary outcome is escalation to back-end engineers, and the GPT-based bot achieves a substantial reduction in escalation while maintaining similar engagement to the baseline; analyses also compare GPT-3.5 and GPT-4, revealing no significant escalation difference but differing token usage and costs. Regression analyses using HC2-standard errors confirm the robustness of the main finding, with additional robustness checks on first-session data supporting the escalation reduction. The work demonstrates practical productivity gains from GPT-based support in a realistic enterprise setting and provides guidance on cost considerations for different GPT models over formal field deployments.

Abstract

In recent years, generative AI has undergone major advancements, demonstrating significant promise in augmenting human productivity. Notably, large language models (LLM), with ChatGPT-4 as an example, have drawn considerable attention. Numerous articles have examined the impact of LLM-based tools on human productivity in lab settings and designed tasks or in observational studies. Despite recent advances, field experiments applying LLM-based tools in realistic settings are limited. This paper presents the findings of a field randomized controlled trial assessing the effectiveness of LLM-based tools in providing unmonitored support services for information retrieval.
Paper Structure (12 sections, 11 equations, 4 figures, 1 table)

This paper contains 12 sections, 11 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Power Virtual Agents: Flow Chart Based System
  • Figure 2: Microsoft PowerVA-based Support Bot Experience
  • Figure 3: GPT-based Support Bot Experience
  • Figure 4: Distributions of Token Consumption for GPT3.5-based and GPT-4 based bots