Table of Contents
Fetching ...

RACER: An LLM-powered Methodology for Scalable Analysis of Semi-structured Mental Health Interviews

Satpreet Harcharan Singh, Kevin Jiang, Kanchan Bhasin, Ashutosh Sabharwal, Nidal Moukaddam, Ankit B Patel

TL;DR

This work presents RACER, an expert-guided LLM-based pipeline that converts semi-structured interview transcripts into thematically clustered insights at scale. By retrieving, aggregating, clustering with expert guidance, and reclustering responses, RACER achieves substantial agreement with human evaluators and provides a scalable approach to analyzing COVID-19-related mental health impacts among 93 healthcare professionals. The study identifies both the potential and limitations of LLM-assisted qualitative analysis, notably that nuanced emotions pose challenges for both humans and machines, and proposes confidence measures to flag ambiguous content. Overall, RACER demonstrates a practical pathway to accelerate qualitative healthcare research while highlighting the enduring need for human expertise in interpreting complex emotional narratives.

Abstract

Semi-structured interviews (SSIs) are a commonly employed data-collection method in healthcare research, offering in-depth qualitative insights into subject experiences. Despite their value, the manual analysis of SSIs is notoriously time-consuming and labor-intensive, in part due to the difficulty of extracting and categorizing emotional responses, and challenges in scaling human evaluation for large populations. In this study, we develop RACER, a Large Language Model (LLM) based expert-guided automated pipeline that efficiently converts raw interview transcripts into insightful domain-relevant themes and sub-themes. We used RACER to analyze SSIs conducted with 93 healthcare professionals and trainees to assess the broad personal and professional mental health impacts of the COVID-19 crisis. RACER achieves moderately high agreement with two human evaluators (72%), which approaches the human inter-rater agreement (77%). Interestingly, LLMs and humans struggle with similar content involving nuanced emotional, ambivalent/dialectical, and psychological statements. Our study highlights the opportunities and challenges in using LLMs to improve research efficiency and opens new avenues for scalable analysis of SSIs in healthcare research.

RACER: An LLM-powered Methodology for Scalable Analysis of Semi-structured Mental Health Interviews

TL;DR

This work presents RACER, an expert-guided LLM-based pipeline that converts semi-structured interview transcripts into thematically clustered insights at scale. By retrieving, aggregating, clustering with expert guidance, and reclustering responses, RACER achieves substantial agreement with human evaluators and provides a scalable approach to analyzing COVID-19-related mental health impacts among 93 healthcare professionals. The study identifies both the potential and limitations of LLM-assisted qualitative analysis, notably that nuanced emotions pose challenges for both humans and machines, and proposes confidence measures to flag ambiguous content. Overall, RACER demonstrates a practical pathway to accelerate qualitative healthcare research while highlighting the enduring need for human expertise in interpreting complex emotional narratives.

Abstract

Semi-structured interviews (SSIs) are a commonly employed data-collection method in healthcare research, offering in-depth qualitative insights into subject experiences. Despite their value, the manual analysis of SSIs is notoriously time-consuming and labor-intensive, in part due to the difficulty of extracting and categorizing emotional responses, and challenges in scaling human evaluation for large populations. In this study, we develop RACER, a Large Language Model (LLM) based expert-guided automated pipeline that efficiently converts raw interview transcripts into insightful domain-relevant themes and sub-themes. We used RACER to analyze SSIs conducted with 93 healthcare professionals and trainees to assess the broad personal and professional mental health impacts of the COVID-19 crisis. RACER achieves moderately high agreement with two human evaluators (72%), which approaches the human inter-rater agreement (77%). Interestingly, LLMs and humans struggle with similar content involving nuanced emotional, ambivalent/dialectical, and psychological statements. Our study highlights the opportunities and challenges in using LLMs to improve research efficiency and opens new avenues for scalable analysis of SSIs in healthcare research.
Paper Structure (26 sections, 8 figures, 1 table)

This paper contains 26 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Stages of the RACER (Retrieve, Aggregate, Cluster with Expert guidance, and Re-cluster) Semi-Structured Interview (SSI) processing pipeline: First, Retrieve relevant responses to each SSI question. Aggregate responses across subjects before Clustering them into themes (and subthemes) defined by Experts. To assess robustness, Re-cluster multiple times and make assignments by majority vote. The pipeline efficiently converts SSI text into meaningful themes with confidence scores.
  • Figure 2: Human-RACER approaches resembles human-human disagreement: (A) Transcript segments from two different subjects being asked “How the [COVID-19] crisis has affected you emotionally” that were either all concordant or all non-concordant between evaluators, displaying the ambiguity that exists in parsing free response. (B) The concordance ratio calculated between evaluator pairs and between all evaluators. A chi-squared test with yates continuity correction between the three different evaluator pairings showed human evaluator concordance did not differ from evaluator one’s concordance with RACER. * p $<$ 0.5, ** p $<$ 0.01, *** p $<$ 0.001, **** p $<$ 0.0001.
  • Figure 3: RACER “confidence” correlates with evaluator concordance and reveals areas of human disagreement: (A) Distribution of the proportion of subject-question pairs that were clustered robustly across subject-question pairs human evaluators examined (20 subjects) or across all subject question pairs (93 subjects). (B) Average RACER confidence scores for all subjects (n = 93) for a given question correlate significantly with the evaluator pair concordance (n = 20) using Spearman Rank. (C) Average RACER confidence scores calculated within concordant vs non-concordant subject-question pairs between evaluators. Chi-square test was conducted to determine if distribution of confidence scores differed between concordant vs non-concordant subject-question pairs.Correlation. * p $<$ 0.5, ** p $<$ 0.01, *** p $<$ 0.001, **** p $<$ 0.0001.
  • Figure 4: Aggregated interview responses to selected questions about safety concerns arising from COVID-19 exposure, work impact, and medical management decisions. Error bars reflect cluster-assignment variability arising from re-clustering step in RACER.
  • Figure 5: Aggregated interview responses to selected questions about emotional and psychological impact, and support and coping strategies. Error bars reflect cluster-assignment variability arising from re-clustering step in RACER.
  • ...and 3 more figures