Table of Contents
Fetching ...

Concept Navigation and Classification via Open-Source Large Language Model Processing

Maël Kubli

TL;DR

The paper tackles extracting latent constructs such as frames, narratives, and topics from large text corpora. It advances a hybrid framework that couples open-source LLMs with structured human-in-the-loop validation to improve interpretability and robustness over traditional NLP methods like LDA and Bertopic. Demonstrations across European Parliament AI debates, US encryption coverage, and the 20 Newsgroups dataset show strong frame-detection performance and competitive topic classification, with intercoder reliability benchmarks providing a human baseline. The work highlights scalable, domain-adaptable analysis while noting that full automated precision remains elusive and outlining future directions in domain-specific fine-tuning and active learning, with attention to ethics and transparency.

Abstract

This paper presents a novel methodological framework for detecting and classifying latent constructs, including frames, narratives, and topics, from textual data using Open-Source Large Language Models (LLMs). The proposed hybrid approach combines automated summarization with human-in-the-loop validation to enhance the accuracy and interpretability of construct identification. By employing iterative sampling coupled with expert refinement, the framework guarantees methodological robustness and ensures conceptual precision. Applied to diverse data sets, including AI policy debates, newspaper articles on encryption, and the 20 Newsgroups data set, this approach demonstrates its versatility in systematically analyzing complex political discourses, media framing, and topic classification tasks.

Concept Navigation and Classification via Open-Source Large Language Model Processing

TL;DR

The paper tackles extracting latent constructs such as frames, narratives, and topics from large text corpora. It advances a hybrid framework that couples open-source LLMs with structured human-in-the-loop validation to improve interpretability and robustness over traditional NLP methods like LDA and Bertopic. Demonstrations across European Parliament AI debates, US encryption coverage, and the 20 Newsgroups dataset show strong frame-detection performance and competitive topic classification, with intercoder reliability benchmarks providing a human baseline. The work highlights scalable, domain-adaptable analysis while noting that full automated precision remains elusive and outlining future directions in domain-specific fine-tuning and active learning, with attention to ethics and transparency.

Abstract

This paper presents a novel methodological framework for detecting and classifying latent constructs, including frames, narratives, and topics, from textual data using Open-Source Large Language Models (LLMs). The proposed hybrid approach combines automated summarization with human-in-the-loop validation to enhance the accuracy and interpretability of construct identification. By employing iterative sampling coupled with expert refinement, the framework guarantees methodological robustness and ensures conceptual precision. Applied to diverse data sets, including AI policy debates, newspaper articles on encryption, and the 20 Newsgroups data set, this approach demonstrates its versatility in systematically analyzing complex political discourses, media framing, and topic classification tasks.

Paper Structure

This paper contains 40 sections, 3 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Construct Class Generation Framework