Table of Contents
Fetching ...

Uncovering Customer Issues through Topological Natural Language Analysis

Shu-Ting Pi, Sidarth Srinivasan, Yuying Zhu, Michael Yang, Qun Liu

TL;DR

This work addresses the challenge of extracting emerging and trending customer issues from massive, unlabeled transcript data by integrating a sentence-level attention model with topological data analysis. The method first tags the primary customer question and produces sentence embeddings, which are whitened to an isotropic space before forming an undirected similarity graph across time windows. Centrality-based measures, including matched and mismatched decay centralities, identify topics that are growing or shifting over time, yielding trending and emerging scores. Validation against human annotations and external signals (forums and news) demonstrates that the approach captures meaningful business-relevant topics and is robust to hyperparameter choices, enabling rapid, data-driven operational insights.

Abstract

E-commerce companies deal with a high volume of customer service requests daily. While a simple annotation system is often used to summarize the topics of customer contacts, thoroughly exploring each specific issue can be challenging. This presents a critical concern, especially during an emerging outbreak where companies must quickly identify and address specific issues. To tackle this challenge, we propose a novel machine learning algorithm that leverages natural language techniques and topological data analysis to monitor emerging and trending customer issues. Our approach involves an end-to-end deep learning framework that simultaneously tags the primary question sentence of each customer's transcript and generates sentence embedding vectors. We then whiten the embedding vectors and use them to construct an undirected graph. From there, we define trending and emerging issues based on the topological properties of each transcript. We have validated our results through various methods and found that they are highly consistent with news sources.

Uncovering Customer Issues through Topological Natural Language Analysis

TL;DR

This work addresses the challenge of extracting emerging and trending customer issues from massive, unlabeled transcript data by integrating a sentence-level attention model with topological data analysis. The method first tags the primary customer question and produces sentence embeddings, which are whitened to an isotropic space before forming an undirected similarity graph across time windows. Centrality-based measures, including matched and mismatched decay centralities, identify topics that are growing or shifting over time, yielding trending and emerging scores. Validation against human annotations and external signals (forums and news) demonstrates that the approach captures meaningful business-relevant topics and is robust to hyperparameter choices, enabling rapid, data-driven operational insights.

Abstract

E-commerce companies deal with a high volume of customer service requests daily. While a simple annotation system is often used to summarize the topics of customer contacts, thoroughly exploring each specific issue can be challenging. This presents a critical concern, especially during an emerging outbreak where companies must quickly identify and address specific issues. To tackle this challenge, we propose a novel machine learning algorithm that leverages natural language techniques and topological data analysis to monitor emerging and trending customer issues. Our approach involves an end-to-end deep learning framework that simultaneously tags the primary question sentence of each customer's transcript and generates sentence embedding vectors. We then whiten the embedding vectors and use them to construct an undirected graph. From there, we define trending and emerging issues based on the topological properties of each transcript. We have validated our results through various methods and found that they are highly consistent with news sources.
Paper Structure (17 sections, 3 equations, 4 figures, 2 tables)

This paper contains 17 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our proposed workflow involves several steps. Initially, the transcripts are passed to a sentence attention model to extract the primary questions asked by the customers and their corresponding sentence embeddings. The embeddings are then whitened to obtain representations in an isotropic coordinate system. These whitened vectors are then utilized to construct an undirected graph, and their topological properties are calculated to identify both trending and emerging issues.
  • Figure 1: An illustration of the functioning of the question tagging model. The model calculates the attention score, $\sigma_{i}$, for each sentence in a transcript. The sentence preceding and following the agent's first sentence, with the highest score is predicted as the customer's primary question, i.e. the highlighted sentence.
  • Figure 2: The Sentence Attention Model. The model consists of blue blocks, representing tensors, and green blocks representing operators. (a) The neural network is comprised of sentence tensors, $S_{i}$, and a sequence model, $\Sigma$, which outputs $Q^{'}{i}$. The position embedding vector $E{i}$ is combined with $Q^{'}{i}$ to create the sentence embeddings $Q{i}$. Finally, a linear classifier predicts the product/service. (b) The red block in (a) is described in detail. Tensor notation ($S_1;T_n$) refers to the $n$-th token in sentence $S_1$. The sequence model $\Sigma$ processes each word in a sentence using a time-distributed wrapper to handle multiple sentences. (c) The orange block in (a) is explained. A dense layer, $K$, with a softmax activation function is applied to all sentence embedding vectors $Q_{i}$ (via a time-distributed wrapper) to calculate attention scores $\sigma_{i}$. Note that $Q_{i}$ is equivalent to $V_{i}$.
  • Figure 3: The centrality distribution of Fire Tablet in Feb 2023 is depicted in panels (a)-(f), where we compare the cosine similarity thresholds of $\alpha = 0.8$ and $\alpha = 0.6$. As for panel (g), the graph built using a few selected samples clearly demonstrates that similar sentences tend to cluster together, resulting in high centrality for nodes around the cluster centers.