Table of Contents
Fetching ...

Exploring the Human-LLM Synergy in Advancing Theory-driven Qualitative Analysis

Han Meng, Yitian Yang, Wayne Fu, Jungup Lee, Yunan Li, Yi-Chieh Lee

TL;DR

The paper introduces CHALET, a collaborative human-LLM framework designed to advance theory-driven qualitative analysis by combining iterative human deductive coding with LLM-assisted coding, disagreement analysis, and inductive code development. Through a mental-illness stigma case study grounded in the attribution model, the authors demonstrate that carefully engineered prompts and codebook components can yield high human-LLM agreement, while disagreements surface opportunities to refine theories and generate new codes. The work emphasizes that human and AI agency are co-constitutive in qualitative analysis, advocating for disagreement as a productive driver of theoretical discovery and methodological innovation. It also discusses cross-cultural, ethical, and practical considerations, offering guidance on applying CHALET to diverse domains and on shaping LLM-integrated qualitative-coding tools for richer, theory-grounded insights.

Abstract

Qualitative coding is a demanding yet crucial research method in the field of Human-Computer Interaction (HCI). While recent studies have shown the capability of large language models (LLMs) to perform qualitative coding within theoretical frameworks, their potential for collaborative human-LLM discovery and generation of new insights beyond initial theory remains underexplored. To bridge this gap, we proposed CHALET, a novel approach that harnesses the power of human-LLM partnership to advance theory-driven qualitative analysis by facilitating iterative coding, disagreement analysis, and conceptualization of qualitative data. We demonstrated CHALET's utility by applying it to the qualitative analysis of conversations related to mental-illness stigma, using the attribution model as the theoretical framework. Results highlighted the unique contribution of human-LLM collaboration in uncovering latent themes of stigma across the cognitive, emotional, and behavioral dimensions. We discuss the methodological implications of the human-LLM collaborative approach to theory-based qualitative analysis for the HCI community and beyond.

Exploring the Human-LLM Synergy in Advancing Theory-driven Qualitative Analysis

TL;DR

The paper introduces CHALET, a collaborative human-LLM framework designed to advance theory-driven qualitative analysis by combining iterative human deductive coding with LLM-assisted coding, disagreement analysis, and inductive code development. Through a mental-illness stigma case study grounded in the attribution model, the authors demonstrate that carefully engineered prompts and codebook components can yield high human-LLM agreement, while disagreements surface opportunities to refine theories and generate new codes. The work emphasizes that human and AI agency are co-constitutive in qualitative analysis, advocating for disagreement as a productive driver of theoretical discovery and methodological innovation. It also discusses cross-cultural, ethical, and practical considerations, offering guidance on applying CHALET to diverse domains and on shaping LLM-integrated qualitative-coding tools for richer, theory-grounded insights.

Abstract

Qualitative coding is a demanding yet crucial research method in the field of Human-Computer Interaction (HCI). While recent studies have shown the capability of large language models (LLMs) to perform qualitative coding within theoretical frameworks, their potential for collaborative human-LLM discovery and generation of new insights beyond initial theory remains underexplored. To bridge this gap, we proposed CHALET, a novel approach that harnesses the power of human-LLM partnership to advance theory-driven qualitative analysis by facilitating iterative coding, disagreement analysis, and conceptualization of qualitative data. We demonstrated CHALET's utility by applying it to the qualitative analysis of conversations related to mental-illness stigma, using the attribution model as the theoretical framework. Results highlighted the unique contribution of human-LLM collaboration in uncovering latent themes of stigma across the cognitive, emotional, and behavioral dimensions. We discuss the methodological implications of the human-LLM collaborative approach to theory-based qualitative analysis for the HCI community and beyond.
Paper Structure (57 sections, 6 figures, 5 tables)

This paper contains 57 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the CHALET approach. In this work, we propose this approach to advance theory-driven qualitative analysis.
  • Figure 2: The CHALET systematic approach: a case study on mental-illness stigma. I. Data Collection: (a) Theoretical Framework. The attribution model attribution_model_corrigan_2003 is selected as the theoretical underpinning. This model covers concepts of responsibility, dangerousness, controllability (i.e., cognitive judgments), fear, pity, anger (i.e., emotional responses), avoidance, helping, and coercive segregation (i.e., behavioral responses). (b) Human-chatbot Conversation. Qualitative data is collected interactively from participants through chatbot-posed questions that align with the theoretical model. For example, participants are queried about their willingness to assist people with mental illness (corresponding to the helping code). II. Human-LLM Synergistic Deductive Coding. Human coders use the attribution model to develop a codebook and code the response as Stigmatizing. This codebook is then learned by LLMs (T1), which, in contrast, code the same response as Non-stigmatizing. III. Collaborative Inductive Coding. Disagreements between human coders and LLMs (T2) inform further qualitative analysis, revealing a perceived CEO-employee relationship and an underlying assumption of a hierarchical social structure. This is hypothesized to indicate a sense of superiority, resulting in the final coding of Stigmatizing (patronization) (T3).
  • Figure 3: Conversation flow design. The conversation comprises three main components: small talk, vignette delivery, and question answering. In the question-answering phase, follow-up questions are used to gather additional information follow_up_q_han_2021; active-listening skills are implemented to engage participants active_chatbot_xiao_2020; neutral self-disclosure is applied to facilitate mutual disclosure disclosure_lee_2022 while maintaining neutrality.
  • Figure 4: Distribution of human-assigned codes across seven attributions, where each message was coded as Stigmatizing, Non-stigmatizing, or Stigmatizing (others). Codes were assigned through iterative deductive coding with inter-rater reliability checks (Cohen's $\kappa$ > 0.6), following the procedures described in Section \ref{['sec:humandeductive']}. See Supplementary Materials for the complete codebook.
  • Figure 5: Cohen's $\kappa$ between LLM-generated and human-assigned codes under different prompts. The $x$-axis shows 23 prompts with different codebook-component combinations. We assign aliases $L1$-$L23$ to these prompts from left to right for brevity. (a)-(g) represents the LLM-human coding consistency for each attribution using prompts with varying codebook components. (h) displays the aggregated Cohen's $\kappa$ across all attributions. All-code info refers to prompts that provide the same amount of codebook components for all codes, while target-code info indicates that the prompt only provides codebook components for the corresponding attribution, with other attributions coded collectively as Stigmatizing (others). name: code name only; +vig: added vignette; +rule: added rules; +keyword: added keywords; +exp: added example. CoT: chain-of-thought; NoCoT: no chain-of-thought. S: Stigmatizing; NS: Non-stigmatizing; O: Stigmatizing (others). S_NS_O denotes the order of the three examples as Stigmatizing, Non-stigmatizing, and Stigmatizing (others). The other notations similarly indicate other permutations. Error bars indicate 95% confidence intervals for Cohen's $\kappa$ estimates across varying prompt conditions. The result highlighted in red represents the one with components that fully match the human-written codebook, albeit in a different format. Prompt configurations were systematically varied to test codebook component effects on LLM-human agreement; detailed prompt templates and design rationale are provided in Supplementary Materials.
  • ...and 1 more figures