Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis
Angelina Parfenova, Andreas Marfurt, Alexander Denzler, Juergen Pfeffer
TL;DR
This study tackles automating inductive open coding in qualitative data analysis by benchmarking six open-source LLMs against human experts. It employs a two-stage evaluation on a dataset enhanced with SemEval data, using ROUGE and BERTScore for open-coding similarity and a two-stage human rating process to assess label quality relative to a golden standard. Key findings reveal that LLMs excel on simpler sentences, while humans outperform on complex items, though some LLMs align more closely with the gold standard yet receive lower expert scores. The work demonstrates the potential of LLMs to automate or assist open coding in practical, data-light contexts and lays the groundwork for integrating LLMs into a full thematic analysis pipeline, including axial coding in future research.
Abstract
This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.
