Table of Contents
Fetching ...

Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis

Angelina Parfenova, Andreas Marfurt, Alexander Denzler, Juergen Pfeffer

TL;DR

This study tackles automating inductive open coding in qualitative data analysis by benchmarking six open-source LLMs against human experts. It employs a two-stage evaluation on a dataset enhanced with SemEval data, using ROUGE and BERTScore for open-coding similarity and a two-stage human rating process to assess label quality relative to a golden standard. Key findings reveal that LLMs excel on simpler sentences, while humans outperform on complex items, though some LLMs align more closely with the gold standard yet receive lower expert scores. The work demonstrates the potential of LLMs to automate or assist open coding in practical, data-light contexts and lays the groundwork for integrating LLMs into a full thematic analysis pipeline, including axial coding in future research.

Abstract

This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.

Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis

TL;DR

This study tackles automating inductive open coding in qualitative data analysis by benchmarking six open-source LLMs against human experts. It employs a two-stage evaluation on a dataset enhanced with SemEval data, using ROUGE and BERTScore for open-coding similarity and a two-stage human rating process to assess label quality relative to a golden standard. Key findings reveal that LLMs excel on simpler sentences, while humans outperform on complex items, though some LLMs align more closely with the gold standard yet receive lower expert scores. The work demonstrates the potential of LLMs to automate or assist open coding in practical, data-light contexts and lays the groundwork for integrating LLMs into a full thematic analysis pipeline, including axial coding in future research.

Abstract

This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.

Paper Structure

This paper contains 28 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Coding in thematic analysis. The source text is split into quotes. The main idea of a paragraph is extracted and becomes a code (open coding). Then, this list of codes is hierarchically grouped into more abstract categories (axial coding).
  • Figure 2: Dataset examples
  • Figure 3: BERT F1 score with an increase of dataset size for all models. The shaded areas represent the standard deviation. The analysis shows how each model benefits from additional data, with some models like Mistral and Falcon displaying higher stability and faster performance gains compared to others. This figure illustrates that the few examples is enough for sufficient finetuning performance.
  • Figure 4: Mistral BERT F1 scores across different numbers of examples.
  • Figure 5: Comparison of Average Ratings and Deviation from Golden Standard (DGS) for LLMs and human coders. Panel (a) shows the average ratings given to both human coders (CoderA, CoderB, CoderC) and various LLM models, segmented by sentence difficulty (Easy, Medium, Difficult). The graph highlights that LLMs generally receive higher ratings on easy sentences compared to human coders, while humans excel in coding more complex sentences. Panel (b) presents the DGS results for both human coders and LLMs across different sentence difficulties, with positive and negative deviations from the golden standard.
  • ...and 1 more figures