The Adoption and Efficacy of Large Language Models: Evidence From Consumer Complaints in the Financial Industry

Minkyu Shin; Jin Kim; Jiwoong Shin

The Adoption and Efficacy of Large Language Models: Evidence From Consumer Complaints in the Financial Industry

Minkyu Shin, Jin Kim, Jiwoong Shin

TL;DR

The paper investigates whether consumers adopt large language models to draft financial complaints and whether such usage causally enhances relief outcomes. It combines observational analysis of a large CFPB dataset with an instrumental-variables approach using ZIP-code proxies for Internet access and English proficiency, and it corroborates findings with controlled lab experiments testing the mechanism of improved presentation. The results show a sharp post-ChatGPT increase in Likely-AI complaints and a positive association with relief; IV estimates suggest a potential causal effect, and lab experiments demonstrate that LLM-enhanced presentation increases relief likelihood by about $10.28$ percentage points. These findings imply that broader, equitable access to LLM tools can improve consumer advocacy and inform regulatory policy on technological accessibility in financial services.

Abstract

Large Language Models (LLMs) are reshaping consumer decision-making, particularly in communication with firms, yet our understanding of their impact remains limited. This research explores the effect of LLMs on consumer complaints submitted to the Consumer Financial Protection Bureau from 2015 to 2024, documenting the adoption of LLMs for drafting complaints and evaluating the likelihood of obtaining relief from financial firms. We analyzed over 1 million complaints and identified a significant increase in LLM usage following the release of ChatGPT. We find that LLM usage is associated with an increased likelihood of obtaining relief from financial firms. To investigate this relationship, we employ an instrumental variable approach to mitigate endogeneity concerns around LLM adoption. Although instrumental variables suggest a potential causal link, they cannot fully capture all unobserved heterogeneity. To further establish this causal relationship, we conducted controlled experiments, which support that LLMs can enhance the clarity and persuasiveness of consumer narratives, thereby increasing the likelihood of obtaining relief. Our findings suggest that facilitating access to LLMs can help firms better understand consumer concerns and level the playing field among consumers. This underscores the importance of policies promoting technological accessibility, enabling all consumers to effectively voice their concerns.

The Adoption and Efficacy of Large Language Models: Evidence From Consumer Complaints in the Financial Industry

TL;DR

percentage points. These findings imply that broader, equitable access to LLM tools can improve consumer advocacy and inform regulatory policy on technological accessibility in financial services.

Abstract

Paper Structure (49 sections, 5 equations, 10 figures, 12 tables)

This paper contains 49 sections, 5 equations, 10 figures, 12 tables.

Introduction
Analysis of LLM Adoption and Efficacy Using the CFPB Dataset
Datasets & AI Detection Tool
CFPB dataset
ACS dataset
AI Detection Tool
Consumer Adoption of LLMs for Writing Complaints in Financial Industry
The Impact of Using LLMs on Getting Relief from Financial Firms
Instrumental variables (IVs) estimation
Falsification Test
Controlled Experiments: Testing the Presentation-Enhancing Effect of LLM
A Summary of Three Pilot Studies
Experiment 1
Experiment 2
Conclusion and Implications
...and 34 more sections

Figures (10)

Figure 1: LLM Adoption Observed in the CFPB dataset
Figure 2: LLM Adoption Patterns Varying Across Regions Based on English proficiency
Figure 3: Enhancement in Presentation for the LLM-Edited vs. Unedited Complaints
Figure S1: This figure presents the density distribution of human scores, defined as 100 - AI score, across the entire dataset. It reveals a bimodal distribution, with two distinct peaks: one concentrated near 0 and the other near 100. These peaks suggest two predominant classifications within the dataset. The blue vertical dotted line marks a human score threshold of 95, while the red vertical dotted line indicates a threshold of 1. Based on this distribution, we classify complaints as "Likely-Human" when the human score exceeds 95% (i.e., AI score $\leq$ 5%), and "Likely-AI" when the human score falls below 1% (i.e., AI score $\geq$ 99%). The density plot thus provides a visual foundation for our classification thresholds, highlighting the natural separation between two groups of complaints.
Figure S2: Each panel in the series displays the temporal trend for "Likely-AI" complaints, defined using various threshold levels. The top panel represents cases classified as Likely-AI when the AI score is greater than or equal to 90%. The middle panel corresponds to cases with an AI score threshold of greater than or equal to 80%, and the bottom panel includes cases with AI scores greater than or equal to 70%. More false positive cases are observed with a lower threshold; that is, when the threshold is less stringent, the percentage of "Likely-AI" complaints is more likely to be greater than zero before the release of ChatGPT.
...and 5 more figures

The Adoption and Efficacy of Large Language Models: Evidence From Consumer Complaints in the Financial Industry

TL;DR

Abstract

The Adoption and Efficacy of Large Language Models: Evidence From Consumer Complaints in the Financial Industry

Authors

TL;DR

Abstract

Table of Contents

Figures (10)