Topic Modeling as Long-Form Generation: Can Long-Context LLMs revolutionize NTM via Zero-Shot Prompting?
Xuan Xu, Haolun Li, Zhongliang Yang, Beilin Chu, Jia Song, Moxuan Xu, Linna Zhou
TL;DR
This work reframes topic modeling as a long-form generation task powered by large language models and demonstrates a practical, out-of-the-box pipeline that summarizes topics into topic cards (summaries, keywords, and representative titles) and assigns documents via keyword matching. It defines an entropy-based analysis of topic and keyword distributions to compare LLM-driven outputs with traditional neural topic models, and conducts a systematic zero-shot comparison on the NYT corpus against several NTMs. The findings show that zero-shot LLMs achieve higher topic diversity and human-like interpretability, with models like Claude Sonnet4 leading on many metrics, while some NTMs retain higher assignment accuracy due to lexical overlap; overall, the results challenge the notion that most NTMs are outdated and highlight the practical advantages of LLM-based topic modeling, including ease of use and multimodal capabilities. The work provides a concrete, scalable paradigm for integrating TM into modern LLM workflows, enabling richer supervision and analysis beyond word-list coherence.
Abstract
Traditional topic models such as neural topic models rely on inference and generation networks to learn latent topic distributions. This paper explores a new paradigm for topic modeling in the era of large language models, framing TM as a long-form generation task whose definition is updated in this paradigm. We propose a simple but practical approach to implement LLM-based topic model tasks out of the box (sample a data subset, generate topics and representative text with our prompt, text assignment with keyword match). We then investigate whether the long-form generation paradigm can beat NTMs via zero-shot prompting. We conduct a systematic comparison between NTMs and LLMs in terms of topic quality and empirically examine the claim that "a majority of NTMs are outdated."
