Concept Navigation and Classification via Open-Source Large Language Model Processing
Maël Kubli
TL;DR
The paper tackles extracting latent constructs such as frames, narratives, and topics from large text corpora. It advances a hybrid framework that couples open-source LLMs with structured human-in-the-loop validation to improve interpretability and robustness over traditional NLP methods like LDA and Bertopic. Demonstrations across European Parliament AI debates, US encryption coverage, and the 20 Newsgroups dataset show strong frame-detection performance and competitive topic classification, with intercoder reliability benchmarks providing a human baseline. The work highlights scalable, domain-adaptable analysis while noting that full automated precision remains elusive and outlining future directions in domain-specific fine-tuning and active learning, with attention to ethics and transparency.
Abstract
This paper presents a novel methodological framework for detecting and classifying latent constructs, including frames, narratives, and topics, from textual data using Open-Source Large Language Models (LLMs). The proposed hybrid approach combines automated summarization with human-in-the-loop validation to enhance the accuracy and interpretability of construct identification. By employing iterative sampling coupled with expert refinement, the framework guarantees methodological robustness and ensures conceptual precision. Applied to diverse data sets, including AI policy debates, newspaper articles on encryption, and the 20 Newsgroups data set, this approach demonstrates its versatility in systematically analyzing complex political discourses, media framing, and topic classification tasks.
