Visual Exploration of Stopword Probabilities in Topic Models
Shuangjiang Xue, Pierre Le Bras, David A. Robb, Mike J. Chantler, Stefano Padilla
TL;DR
This work tackles the problem of stopword removal biases in topic-model visualizations by proposing a corpus-specific probabilistic stopword estimation using Gaussian Process Classification (GPC) and an interactive interface. The method builds a Swdf representation across topics, trains a GPC to output Pt, and enables threshold-driven extraction of stopwords within an integrated topic-model visualization, including a 2-D GPC Matrix. A qualitative study with 12 experts shows the approach improves perceived credibility, supports representative stopword extensions, and allows user-controlled decision-making via an adjustable threshold, albeit with a need for clearer explanation of the 2-D visualization. The work offers practical recommendations for practitioners and suggests future integrations into existing topic-model pipelines to enhance robustness and interpretability of stopword analysis in ML visualizations.
Abstract
Stopword removal is a critical stage in many Machine Learning methods but often receives little consideration, it interferes with the model visualizations and disrupts user confidence. Inappropriately chosen or hastily omitted stopwords not only lead to suboptimal performance but also significantly affect the quality of models, thus reducing the willingness of practitioners and stakeholders to rely on the output visualizations. This paper proposes a novel extraction method that provides a corpus-specific probabilistic estimation of stopword likelihood and an interactive visualization system to support their analysis. We evaluated our approach and interface using real-world data, a commonly used Machine Learning method (Topic Modelling), and a comprehensive qualitative experiment probing user confidence. The results of our work show that our system increases user confidence in the credibility of topic models by (1) returning reasonable probabilities, (2) generating an appropriate and representative extension of common stopword lists, and (3) providing an adjustable threshold for estimating and analyzing stopwords visually. Finally, we discuss insights, recommendations, and best practices to support practitioners while improving the output of Machine Learning methods and topic model visualizations with robust stopword analysis and removal.
