Table of Contents
Fetching ...

Automating the Information Extraction from Semi-Structured Interview Transcripts

Angelina Parfenova

TL;DR

This work tackles the challenge of automating information extraction from semi-structured interview transcripts to alleviate the labor-intensive qualitative coding process. It systematically compares topic-modeling approaches and identifies a BERT-based embeddings pipeline with UMAP for dimensionality reduction and HDBSCAN clustering as the most effective, culminating in a user-friendly prototype for researchers without programming skills. The study finds that transformer-based representations provide clearer, more diverse topics than traditional LDA in semi-structured transcripts, and emphasizes interactive topic-network visualization to reveal inter-topic connections. Practically, the tool enables faster, scalable qualitative analysis across domains such as marketing, social science, and healthcare, while acknowledging that automation complements rather than replaces expert interpretation.

Abstract

This paper explores the development and application of an automated system designed to extract information from semi-structured interview transcripts. Given the labor-intensive nature of traditional qualitative analysis methods, such as coding, there exists a significant demand for tools that can facilitate the analysis process. Our research investigates various topic modeling techniques and concludes that the best model for analyzing interview texts is a combination of BERT embeddings and HDBSCAN clustering. We present a user-friendly software prototype that enables researchers, including those without programming skills, to efficiently process and visualize the thematic structure of interview data. This tool not only facilitates the initial stages of qualitative analysis but also offers insights into the interconnectedness of topics revealed, thereby enhancing the depth of qualitative analysis.

Automating the Information Extraction from Semi-Structured Interview Transcripts

TL;DR

This work tackles the challenge of automating information extraction from semi-structured interview transcripts to alleviate the labor-intensive qualitative coding process. It systematically compares topic-modeling approaches and identifies a BERT-based embeddings pipeline with UMAP for dimensionality reduction and HDBSCAN clustering as the most effective, culminating in a user-friendly prototype for researchers without programming skills. The study finds that transformer-based representations provide clearer, more diverse topics than traditional LDA in semi-structured transcripts, and emphasizes interactive topic-network visualization to reveal inter-topic connections. Practically, the tool enables faster, scalable qualitative analysis across domains such as marketing, social science, and healthcare, while acknowledging that automation complements rather than replaces expert interpretation.

Abstract

This paper explores the development and application of an automated system designed to extract information from semi-structured interview transcripts. Given the labor-intensive nature of traditional qualitative analysis methods, such as coding, there exists a significant demand for tools that can facilitate the analysis process. Our research investigates various topic modeling techniques and concludes that the best model for analyzing interview texts is a combination of BERT embeddings and HDBSCAN clustering. We present a user-friendly software prototype that enables researchers, including those without programming skills, to efficiently process and visualize the thematic structure of interview data. This tool not only facilitates the initial stages of qualitative analysis but also offers insights into the interconnectedness of topics revealed, thereby enhancing the depth of qualitative analysis.
Paper Structure (17 sections, 5 equations, 3 figures, 2 tables)

This paper contains 17 sections, 5 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The coding process visualized
  • Figure 2: Comparison of model performance (Size of the bubble is Topic Diversity)
  • Figure 3: Output of the model on the set of interviews parfenova2023regulatory