Table of Contents
Fetching ...

NLP-based detection of systematic anomalies among the narratives of consumer complaints

Peiheng Gao, Ning Sun, Xuefeng Wang, Chen Yang, Ričardas Zitikis

TL;DR

The paper tackles detecting systematic, nonmeritorious consumer complaints within narrative data by coupling an NLP-based classifier with downstream anomaly-detection on quantified narrative signals. It introduces two input-output systems based on TF-IDF and TF-IDF-VADER featurizations, and analyzes them with two indices, the I-index and B-index, to identify background risk without specifying distributions. A Cobb-Douglas-like relationship links sentiment, adjusted dollars, and word counts to produce transferable signals, enabling robust anomaly detection on the meritorious subset identified by classification. Empirically, SVM with TF-IDF achieves the strongest classification performance, while TF-IDF-VADER tends to reduce the presence of non-meritorious cases in the meritorious set, offering practical guidance for prioritizing reliefs in CFPB data and similar regulatory contexts.

Abstract

We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.

NLP-based detection of systematic anomalies among the narratives of consumer complaints

TL;DR

The paper tackles detecting systematic, nonmeritorious consumer complaints within narrative data by coupling an NLP-based classifier with downstream anomaly-detection on quantified narrative signals. It introduces two input-output systems based on TF-IDF and TF-IDF-VADER featurizations, and analyzes them with two indices, the I-index and B-index, to identify background risk without specifying distributions. A Cobb-Douglas-like relationship links sentiment, adjusted dollars, and word counts to produce transferable signals, enabling robust anomaly detection on the meritorious subset identified by classification. Empirically, SVM with TF-IDF achieves the strongest classification performance, while TF-IDF-VADER tends to reduce the presence of non-meritorious cases in the meritorious set, offering practical guidance for prioritizing reliefs in CFPB data and similar regulatory contexts.

Abstract

We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.
Paper Structure (22 sections, 34 equations, 11 figures, 4 tables)

This paper contains 22 sections, 34 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 4.1: Scatter-plot \ref{['log CD-0sp']} and the fitted least squares regression line.
  • Figure 4.2: Diagnostic plots for model \ref{['log CD-0']}.
  • Figure 4.3: Scatter-plot \ref{['log CD-sp']} and the fitted least squares regression line.
  • Figure 4.4: Diagnostic plots for model \ref{['log CD-0']}.
  • Figure 6.1: The plots of $I_n$ for the five classifications.
  • ...and 6 more figures