Table of Contents
Fetching ...

Decoding Complexity: Exploring Human-AI Concordance in Qualitative Coding

Elisabeth Kirsten, Annalina Buckmann, Abraham Mhaidli, Steffen Becker

TL;DR

This study addresses the problem of time-consuming qualitative coding by evaluating how large language models (LLMs) perform on QDA tasks of varying complexity. Using GPT-3.5 and GPT-4, with zero-shot, one-shot, and few-shot prompts, the authors compare model outputs to human coding on three real-world interview datasets spanning semantic and latent coding. Results show GPT-4 consistently aligns more closely with human coders than GPT-3.5, with Task A achieving near-perfect agreement and Task C more challenging; few-shot prompts help GPT-3.5 reduce errors but do not guarantee overall improvements. The findings highlight task-specific considerations, prompt design choices, and ethical and methodological implications for integrating LLMs into qualitative research.

Abstract

Qualitative data analysis provides insight into the underlying perceptions and experiences within unstructured data. However, the time-consuming nature of the coding process, especially for larger datasets, calls for innovative approaches, such as the integration of Large Language Models (LLMs). This short paper presents initial findings from a study investigating the integration of LLMs for coding tasks of varying complexity in a real-world dataset. Our results highlight the challenges inherent in coding with extensive codebooks and contexts, both for human coders and LLMs, and suggest that the integration of LLMs into the coding process requires a task-by-task evaluation. We examine factors influencing the complexity of coding tasks and initiate a discussion on the usefulness and limitations of incorporating LLMs in qualitative research.

Decoding Complexity: Exploring Human-AI Concordance in Qualitative Coding

TL;DR

This study addresses the problem of time-consuming qualitative coding by evaluating how large language models (LLMs) perform on QDA tasks of varying complexity. Using GPT-3.5 and GPT-4, with zero-shot, one-shot, and few-shot prompts, the authors compare model outputs to human coding on three real-world interview datasets spanning semantic and latent coding. Results show GPT-4 consistently aligns more closely with human coders than GPT-3.5, with Task A achieving near-perfect agreement and Task C more challenging; few-shot prompts help GPT-3.5 reduce errors but do not guarantee overall improvements. The findings highlight task-specific considerations, prompt design choices, and ethical and methodological implications for integrating LLMs into qualitative research.

Abstract

Qualitative data analysis provides insight into the underlying perceptions and experiences within unstructured data. However, the time-consuming nature of the coding process, especially for larger datasets, calls for innovative approaches, such as the integration of Large Language Models (LLMs). This short paper presents initial findings from a study investigating the integration of LLMs for coding tasks of varying complexity in a real-world dataset. Our results highlight the challenges inherent in coding with extensive codebooks and contexts, both for human coders and LLMs, and suggest that the integration of LLMs into the coding process requires a task-by-task evaluation. We examine factors influencing the complexity of coding tasks and initiate a discussion on the usefulness and limitations of incorporating LLMs in qualitative research.
Paper Structure (12 sections, 1 figure, 1 table)

This paper contains 12 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Prompt template for coding Internet-connected devices with optional examples. We modify the highlighted text sections to change the coding task.