Table of Contents
Fetching ...

Visual Exploration of Stopword Probabilities in Topic Models

Shuangjiang Xue, Pierre Le Bras, David A. Robb, Mike J. Chantler, Stefano Padilla

TL;DR

This work tackles the problem of stopword removal biases in topic-model visualizations by proposing a corpus-specific probabilistic stopword estimation using Gaussian Process Classification (GPC) and an interactive interface. The method builds a Swdf representation across topics, trains a GPC to output Pt, and enables threshold-driven extraction of stopwords within an integrated topic-model visualization, including a 2-D GPC Matrix. A qualitative study with 12 experts shows the approach improves perceived credibility, supports representative stopword extensions, and allows user-controlled decision-making via an adjustable threshold, albeit with a need for clearer explanation of the 2-D visualization. The work offers practical recommendations for practitioners and suggests future integrations into existing topic-model pipelines to enhance robustness and interpretability of stopword analysis in ML visualizations.

Abstract

Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.

Visual Exploration of Stopword Probabilities in Topic Models

TL;DR

This work tackles the problem of stopword removal biases in topic-model visualizations by proposing a corpus-specific probabilistic stopword estimation using Gaussian Process Classification (GPC) and an interactive interface. The method builds a Swdf representation across topics, trains a GPC to output Pt, and enables threshold-driven extraction of stopwords within an integrated topic-model visualization, including a 2-D GPC Matrix. A qualitative study with 12 experts shows the approach improves perceived credibility, supports representative stopword extensions, and allows user-controlled decision-making via an adjustable threshold, albeit with a need for clearer explanation of the 2-D visualization. The work offers practical recommendations for practitioners and suggests future integrations into existing topic-model pipelines to enhance robustness and interpretability of stopword analysis in ML visualizations.

Abstract

Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.
Paper Structure (29 sections, 5 equations, 4 figures, 8 tables)

This paper contains 29 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Examples of the $\mathbf{Swdf}_j$ of a representative stopword "this" and a topic word "bayesian"
  • Figure 2: The comparison of the original layout with the monotone decreasing layout. Red stands for stopwords while blue stands for topic words. The rectangular shows 64% confidence interval of each dimension and the line in the middle stands for the mean.
  • Figure 3: The GPC Matrix: a 2-D approximate visualization of the GPC model. Cells divide the problem space into a $30 \times 50$ grid. The background colour of each cell corresponds to the trained probabilities of a word being a topic word, given a dimension ($h$) and a document frequency ($df$). A blue background signifies a high probability of being a topic word. A red background signifies a high probability of being a stopword. Upon selecting a word (within a topic), the user is provided with the superimposition of the estimated probabilities for that word as connected black circles. The three examples above provide the typical three cases: \ref{['fig:2dgp_stop']} a clear stopword, \ref{['fig:2dgp_middle']} a borderline word and \ref{['fig:2dgp_topic']} a clear topic word.
  • Figure 4: The scatter plots of the mean and standard deviation of the $\mathbf{Swdf}_j$ for topic words (up left), the $\mathbf{Swdf}_j$ for stopwords (up right), the $\mathbf{Swtf}_j$ for topic words (bottom left) and the $\mathbf{Swtf}_j$ for stopwords (bottom right)