AI Revolution on Chat Bot: Evidence from a Randomized Controlled Experiment
Sida Peng, Wojciech Swiatek, Allen Gao, Paul Cullivan, Haoge Chang
TL;DR
This study addresses whether LLM-based tools improve real-world information retrieval support by conducting a field randomized controlled trial comparing a GPT-based bot to a flowchart-based PowerVA bot for Microsoft developers. The primary outcome is escalation to back-end engineers, and the GPT-based bot achieves a substantial reduction in escalation while maintaining similar engagement to the baseline; analyses also compare GPT-3.5 and GPT-4, revealing no significant escalation difference but differing token usage and costs. Regression analyses using HC2-standard errors confirm the robustness of the main finding, with additional robustness checks on first-session data supporting the escalation reduction. The work demonstrates practical productivity gains from GPT-based support in a realistic enterprise setting and provides guidance on cost considerations for different GPT models over formal field deployments.
Abstract
In recent years, generative AI has undergone major advancements, demonstrating significant promise in augmenting human productivity. Notably, large language models (LLM), with ChatGPT-4 as an example, have drawn considerable attention. Numerous articles have examined the impact of LLM-based tools on human productivity in lab settings and designed tasks or in observational studies. Despite recent advances, field experiments applying LLM-based tools in realistic settings are limited. This paper presents the findings of a field randomized controlled trial assessing the effectiveness of LLM-based tools in providing unmonitored support services for information retrieval.
