Table of Contents
Fetching ...

Moving Beyond LDA: A Comparison of Unsupervised Topic Modelling Techniques for Qualitative Data Analysis of Online Communities

Amandeep Kaur, James R. Wallace

TL;DR

This work addresses the barrier that qualitative researchers face in applying topic modelling to large social media corpora. It evaluates three unsupervised techniques—LDA, NMF, and BERTopic—by integrating BERTopic into the Computational Thematic Analysis Toolkit and conducting interviews with qualitative researchers. Results show BERTopic delivers superior topic coherence, diversity, and the ability to reveal hidden relationships, albeit with higher computation and navigation complexity; researchers nevertheless valued its granularity and interpretability. The study demonstrates the potential of LLM-based methods to enhance qualitative analysis workflows and provides design guidance for usable, ethical, and explainable tooling, with future work focusing on hierarchical visualizations and broader, longitudinal evaluations.

Abstract

Social media constitutes a rich and influential source of information for qualitative researchers. Although computational techniques like topic modelling assist with managing the volume and diversity of social media content, qualitative researcher's lack of programming expertise creates a significant barrier to their adoption. In this paper we explore how BERTopic, an advanced Large Language Model (LLM)-based topic modelling technique, can support qualitative data analysis of social media. We conducted interviews and hands-on evaluations in which qualitative researchers compared topics from three modelling techniques: LDA, NMF, and BERTopic. BERTopic was favoured by 8 of 12 participants for its ability to provide detailed, coherent clusters for deeper understanding and actionable insights. Participants also prioritised topic relevance, logical organisation, and the capacity to reveal unexpected relationships within the data. Our findings underscore the potential of LLM-based techniques for supporting qualitative analysis.

Moving Beyond LDA: A Comparison of Unsupervised Topic Modelling Techniques for Qualitative Data Analysis of Online Communities

TL;DR

This work addresses the barrier that qualitative researchers face in applying topic modelling to large social media corpora. It evaluates three unsupervised techniques—LDA, NMF, and BERTopic—by integrating BERTopic into the Computational Thematic Analysis Toolkit and conducting interviews with qualitative researchers. Results show BERTopic delivers superior topic coherence, diversity, and the ability to reveal hidden relationships, albeit with higher computation and navigation complexity; researchers nevertheless valued its granularity and interpretability. The study demonstrates the potential of LLM-based methods to enhance qualitative analysis workflows and provides design guidance for usable, ethical, and explainable tooling, with future work focusing on hierarchical visualizations and broader, longitudinal evaluations.

Abstract

Social media constitutes a rich and influential source of information for qualitative researchers. Although computational techniques like topic modelling assist with managing the volume and diversity of social media content, qualitative researcher's lack of programming expertise creates a significant barrier to their adoption. In this paper we explore how BERTopic, an advanced Large Language Model (LLM)-based topic modelling technique, can support qualitative data analysis of social media. We conducted interviews and hands-on evaluations in which qualitative researchers compared topics from three modelling techniques: LDA, NMF, and BERTopic. BERTopic was favoured by 8 of 12 participants for its ability to provide detailed, coherent clusters for deeper understanding and actionable insights. Participants also prioritised topic relevance, logical organisation, and the capacity to reveal unexpected relationships within the data. Our findings underscore the potential of LLM-based techniques for supporting qualitative analysis.

Paper Structure

This paper contains 47 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: LDA topics and keywords, topic_id depicts the topic number, the topic_keywords describe the keywords generated by the model, and the other quantitative results show the characteristics for this specific topic in the corpora.
  • Figure 2: Top2Vec topics and keywords, ${\pi}$ depicts the topic number, the keywords describe the keywords generated by the model, by default it gives the first 50 words asscociated with the model.
  • Figure 3: BERTopic topics and keywords, Topic 6 depicts the topic number, the string words describe the keywords generated by the model, and the numeric values depict the probability of the keyword, by default it gives the first 10 words asscociated with the model.
  • Figure 4: Modeling and Sampling Module in the CTA Toolkit with BERTopic Integration: Model Details, a Topic List with keywords, Sample List of entries on the left side, and a Chord Graph for with Topics as Word clouds on the right. This layout facilitates detailed and comprehensive analysis of topic modeling results.
  • Figure 5: Chord diagram for LDA on the r/MachineLearning dataset, highlighting critiques of irrelevant symbols and the presence of mathematical symbols.
  • ...and 2 more figures