Table of Contents
Fetching ...

LLMCode: Evaluating and Enhancing Researcher-AI Alignment in Qualitative Analysis

Joel Oksanen, Andrés Lucero, Perttu Hämäläinen

TL;DR

This paper tackles the challenge of aligning large language models with the nuanced, reflexive insights central to research for design (RfD). It introduces LLMCode, an open-source toolkit that uses two metrics, Intersection over Union ($IoU$) and Modified Hausdorff Distance ($MHD$), to quantify how closely AI-generated coding matches human coding and how semantically aligned the codes are. Across two studies with 26 designers, the authors show LLMs can match deductive coding patterns but struggle to emulate deeper interpretive reasoning, highlighting the need for ongoing human oversight and iterative collaboration. The work advances the field by providing a concrete evaluation framework and an interactive interface that helps researchers manage AI-assisted qualitative coding while preserving interpretive depth, thereby informing the design of more trustworthy researcher-AI tools in qualitative inquiry.

Abstract

The use of large language models (LLMs) in qualitative analysis offers enhanced efficiency but raises questions about their alignment with the contextual nature of research for design (RfD). This research examines the trustworthiness of LLM-driven design insights, using qualitative coding as a case study to explore the interpretive processes central to RfD. We introduce LLMCode, an open-source tool integrating two metrics, namely Intersection over Union (IoU) and Modified Hausdorff Distance, to assess the alignment between human and LLM-generated insights. Across two studies involving 26 designers, we find that while the model performs well with deductive coding, its ability to emulate a designer's deeper interpretive lens over the data is limited, emphasising the importance of human-AI collaboration. Our results highlight a reciprocal dynamic where users refine LLM outputs and adapt their own perspectives based on the model's suggestions. These findings underscore the importance of fostering appropriate reliance on LLMs by designing tools that preserve interpretive depth while facilitating intuitive collaboration between designers and AI.

LLMCode: Evaluating and Enhancing Researcher-AI Alignment in Qualitative Analysis

TL;DR

This paper tackles the challenge of aligning large language models with the nuanced, reflexive insights central to research for design (RfD). It introduces LLMCode, an open-source toolkit that uses two metrics, Intersection over Union () and Modified Hausdorff Distance (), to quantify how closely AI-generated coding matches human coding and how semantically aligned the codes are. Across two studies with 26 designers, the authors show LLMs can match deductive coding patterns but struggle to emulate deeper interpretive reasoning, highlighting the need for ongoing human oversight and iterative collaboration. The work advances the field by providing a concrete evaluation framework and an interactive interface that helps researchers manage AI-assisted qualitative coding while preserving interpretive depth, thereby informing the design of more trustworthy researcher-AI tools in qualitative inquiry.

Abstract

The use of large language models (LLMs) in qualitative analysis offers enhanced efficiency but raises questions about their alignment with the contextual nature of research for design (RfD). This research examines the trustworthiness of LLM-driven design insights, using qualitative coding as a case study to explore the interpretive processes central to RfD. We introduce LLMCode, an open-source tool integrating two metrics, namely Intersection over Union (IoU) and Modified Hausdorff Distance, to assess the alignment between human and LLM-generated insights. Across two studies involving 26 designers, we find that while the model performs well with deductive coding, its ability to emulate a designer's deeper interpretive lens over the data is limited, emphasising the importance of human-AI collaboration. Our results highlight a reciprocal dynamic where users refine LLM outputs and adapt their own perspectives based on the model's suggestions. These findings underscore the importance of fostering appropriate reliance on LLMs by designing tools that preserve interpretive depth while facilitating intuitive collaboration between designers and AI.

Paper Structure

This paper contains 32 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Mean IoU plotted against the number of few-shot examples included in the coding prompt (Study 1, N=11). The metric increases as the number of few-shot examples increases, indicating improved alignment between the LLM and human annotations.
  • Figure 2: Mean MHD plotted against the number of few-shot examples included in the coding prompt (Study 1, N=11). The metric decreases as the number of few-shot examples increases, indicating improved alignment between the LLM- and human-annotated codes.
  • Figure 3: New codes as a fraction of all annotated codes in the given time frame (Study 1, N=15). The rate at which new codes are discovered decreases over time as the codebook becomes saturated.
  • Figure 4: Scatter plot showing the correlation (Pearson coefficient = 0.53) between the similarity of human-annotated codes to the example set and the model’s performance (Study 1, N=8). Higher dissimilarity generally leads to lower alignment, though some cases show the model successfully extrapolating beyond the examples. A line of equality $y = x$ is added to illustrate how the model rarely performs worse than what is expected based on its examples' similarity to the texts' underlying codes.
  • Figure 5: The development of (a) IoU and (b) Modified Hausdorff Distance as participants in Study 2 manually iterated on their few-shot example sets (human), compared to a random sampling baseline with equal numbers of positive and negative examples (random). Curves for individual participants are adjusted to match the initial mean, for visualisation purposes.