Table of Contents
Fetching ...

Topic Modeling with Fine-tuning LLMs and Bag of Sentences

Johannes Schneider

TL;DR

The paper tackles the challenge of leveraging fine-tuned LLM encoders for topic modeling without labeled data by introducing FT-Topic, an unsupervised fine-tuning approach using sentence-group triplet losses, and SenClu, a fast Bag-of-Sentences topic model with hard topic assignments and EM-based inference. SenClu represents topics and sentence groups as continuous vectors and uses cosine similarity to drive topic assignments, while a derived word-topic scoring scheme emphasizes informative terms. Empirical results across multiple datasets show that FT-Topic enhances topic-modeling quality, and SenClu achieves state-of-the-art PMI coherence and NMI, often outperforming strong baselines with substantially faster inference than competitors like TopClus. The work demonstrates a practical, adaptable pipeline that combines unsupervised fine-tuning with BoS-based clustering to deliver high-quality, interpretable topics suitable for downstream tasks and user guidance, while outlining clear paths for further improvement.

Abstract

Large language models (LLM)'s are increasingly used for topic modeling outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable (labeled) dataset for fine-tuning. In this paper, we use the recent idea to use bag of sentences as the elementary unit in computing topics. In turn, we derive an approach FT-Topic to perform unsupervised fine-tuning relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method to identifies pairs of sentence groups that are either assumed to be of the same or different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach using embeddings. However, in this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu, which achieves fast inference through an expectation-maximization algorithm and hard assignments of sentence groups to a single topic, while giving users the possibility to encode prior knowledge on the topic-document distribution. Code is at \url{https://github.com/JohnTailor/FT-Topic}

Topic Modeling with Fine-tuning LLMs and Bag of Sentences

TL;DR

The paper tackles the challenge of leveraging fine-tuned LLM encoders for topic modeling without labeled data by introducing FT-Topic, an unsupervised fine-tuning approach using sentence-group triplet losses, and SenClu, a fast Bag-of-Sentences topic model with hard topic assignments and EM-based inference. SenClu represents topics and sentence groups as continuous vectors and uses cosine similarity to drive topic assignments, while a derived word-topic scoring scheme emphasizes informative terms. Empirical results across multiple datasets show that FT-Topic enhances topic-modeling quality, and SenClu achieves state-of-the-art PMI coherence and NMI, often outperforming strong baselines with substantially faster inference than competitors like TopClus. The work demonstrates a practical, adaptable pipeline that combines unsupervised fine-tuning with BoS-based clustering to deliver high-quality, interpretable topics suitable for downstream tasks and user guidance, while outlining clear paths for further improvement.

Abstract

Large language models (LLM)'s are increasingly used for topic modeling outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to improve LLMs considerably. The challenge lies in obtaining a suitable (labeled) dataset for fine-tuning. In this paper, we use the recent idea to use bag of sentences as the elementary unit in computing topics. In turn, we derive an approach FT-Topic to perform unsupervised fine-tuning relying primarily on two steps for constructing a training dataset in an automatic fashion. First, a heuristic method to identifies pairs of sentence groups that are either assumed to be of the same or different topics. Second, we remove sentence pairs that are likely labeled incorrectly. The dataset is then used to fine-tune an encoder LLM, which can be leveraged by any topic modeling approach using embeddings. However, in this work, we demonstrate its effectiveness by deriving a novel state-of-the-art topic modeling method called SenClu, which achieves fast inference through an expectation-maximization algorithm and hard assignments of sentence groups to a single topic, while giving users the possibility to encode prior knowledge on the topic-document distribution. Code is at \url{https://github.com/JohnTailor/FT-Topic}
Paper Structure (13 sections, 8 equations, 1 figure, 10 tables, 2 algorithms)

This paper contains 13 sections, 8 equations, 1 figure, 10 tables, 2 algorithms.

Figures (1)

  • Figure 1: Overview of training data generation for fine-tuning assuming a corpus $D$ of two documents and three distinct topics using single sentences. For the sentence This hockey season... we sample sentences assumed to be in the same and distinct topic, wrong samples are removed based on similarity computation using a non-fine-tuned LLM.