Table of Contents
Fetching ...

Topic Modelling: Going Beyond Token Outputs

Lowri Williams, Eirini Anthi, Laura Arman, Pete Burnap

TL;DR

The paper tackles the interpretability gap in token-based topic modelling outputs by introducing a privacy-conscious, internal-data approach that extends topics with high-scoring keywords drawn from the same corpus. The workflow maps documents to dominant topics, applies RAKE to extract topic-specific keywords, and maps these keywords to the model’s token outputs to create cohesive descriptor phrases. Human-subject experiments show that extended descriptors improve interpretability (higher quality and usefulness) and annotation efficiency compared to traditional LDA outputs, with strong inter-annotator agreement. The method generalises to state-of-the-art models (BERTopic, Top2Vec) and remains effective on unseen data (20 Newsgroups), offering a lightweight, real-time, privacy-preserving enhancement to topic interpretation with practical benefits for information retrieval and decision support.

Abstract

Topic modelling is a text mining technique for identifying salient themes from a number of documents. The output is commonly a set of topics consisting of isolated tokens that often co-occur in such documents. Manual effort is often associated with interpreting a topic's description from such tokens. However, from a human's perspective, such outputs may not adequately provide enough information to infer the meaning of the topics; thus, their interpretability is often inaccurately understood. Although several studies have attempted to automatically extend topic descriptions as a means of enhancing the interpretation of topic models, they rely on external language sources that may become unavailable, must be kept up-to-date to generate relevant results, and present privacy issues when training on or processing data. This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens. This approach removes the dependence on external sources by using the textual data itself by extracting high-scoring keywords and mapping them to the topic model's token outputs. To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output based on their quality and usefulness, as well as the efficiency of the annotation task. The proposed approach demonstrated higher quality and usefulness, as well as higher efficiency in the annotation task, in comparison to the outputs of a traditional topic modelling method, demonstrating an increase in their interpretability.

Topic Modelling: Going Beyond Token Outputs

TL;DR

The paper tackles the interpretability gap in token-based topic modelling outputs by introducing a privacy-conscious, internal-data approach that extends topics with high-scoring keywords drawn from the same corpus. The workflow maps documents to dominant topics, applies RAKE to extract topic-specific keywords, and maps these keywords to the model’s token outputs to create cohesive descriptor phrases. Human-subject experiments show that extended descriptors improve interpretability (higher quality and usefulness) and annotation efficiency compared to traditional LDA outputs, with strong inter-annotator agreement. The method generalises to state-of-the-art models (BERTopic, Top2Vec) and remains effective on unseen data (20 Newsgroups), offering a lightweight, real-time, privacy-preserving enhancement to topic interpretation with practical benefits for information retrieval and decision support.

Abstract

Topic modelling is a text mining technique for identifying salient themes from a number of documents. The output is commonly a set of topics consisting of isolated tokens that often co-occur in such documents. Manual effort is often associated with interpreting a topic's description from such tokens. However, from a human's perspective, such outputs may not adequately provide enough information to infer the meaning of the topics; thus, their interpretability is often inaccurately understood. Although several studies have attempted to automatically extend topic descriptions as a means of enhancing the interpretation of topic models, they rely on external language sources that may become unavailable, must be kept up-to-date to generate relevant results, and present privacy issues when training on or processing data. This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens. This approach removes the dependence on external sources by using the textual data itself by extracting high-scoring keywords and mapping them to the topic model's token outputs. To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output based on their quality and usefulness, as well as the efficiency of the annotation task. The proposed approach demonstrated higher quality and usefulness, as well as higher efficiency in the annotation task, in comparison to the outputs of a traditional topic modelling method, demonstrating an increase in their interpretability.
Paper Structure (11 sections, 17 figures, 5 tables)

This paper contains 11 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: An overview of the study design
  • Figure 2: Extended topic outputs for all datasets
  • Figure 3: Bespoke annotation platform
  • Figure 4: Distribution of annotations across the quality of extended outputs
  • Figure 5: Distribution of annotations across the usefulness of extended outputs
  • ...and 12 more figures