Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

Zackary Okun Dunivin

Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

Zackary Okun Dunivin

TL;DR

This paper addresses whether large language models can perform qualitative coding with human-equivalent accuracy on complex, paragraph-length texts. Using a W.E.B. Du Bois case study—adapting a 9-code, 3-category codebook to 232 NYT passages—the authors compare GPT-4 and GPT-3.5 under zero-shot and chain-of-thought prompting. GPT-4 shows human-equivalent interpretations with high intercoder reliability (Cohen's $κ$ on multiple codes; average gains with CoT and per-code prompts), while GPT-3.5 performs markedly worse. The work provides actionable best practices for codebook design, prompting strategies, and practical considerations for adopting LLM-assisted content analysis at scale, and discusses trajectories for future models.

Abstract

Qualitative coding, or content analysis, extracts meaning from text to discern quantitative patterns across a corpus of texts. Recently, advances in the interpretive abilities of large language models (LLMs) offer potential for automating the coding process (applying category labels to texts), thereby enabling human researchers to concentrate on more creative research aspects, while delegating these interpretive tasks to AI. Our case study comprises a set of socio-historical codes on dense, paragraph-long passages representative of a humanistic study. We show that GPT-4 is capable of human-equivalent interpretations, whereas GPT-3.5 is not. Compared to our human-derived gold standard, GPT-4 delivers excellent intercoder reliability (Cohen's $κ\geq 0.79$) for 3 of 9 codes, and substantial reliability ($κ\geq 0.6$) for 8 of 9 codes. In contrast, GPT-3.5 greatly underperforms for all codes ($mean(κ) = 0.34$; $max(κ) = 0.55$). Importantly, we find that coding fidelity improves considerably when the LLM is prompted to give rationale justifying its coding decisions (chain-of-thought reasoning). We present these and other findings along with a set of best practices for adapting traditional codebooks for LLMs. Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis. Furthermore, they suggest the next generation of models will likely render AI coding a viable option for a majority of codebooks.

Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

TL;DR

on multiple codes; average gains with CoT and per-code prompts), while GPT-3.5 performs markedly worse. The work provides actionable best practices for codebook design, prompting strategies, and practical considerations for adopting LLM-assisted content analysis at scale, and discusses trajectories for future models.

Abstract

) for 3 of 9 codes, and substantial reliability (

) for 8 of 9 codes. In contrast, GPT-3.5 greatly underperforms for all codes (

;

). Importantly, we find that coding fidelity improves considerably when the LLM is prompted to give rationale justifying its coding decisions (chain-of-thought reasoning). We present these and other findings along with a set of best practices for adapting traditional codebooks for LLMs. Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis. Furthermore, they suggest the next generation of models will likely render AI coding a viable option for a majority of codebooks.

Paper Structure (17 sections, 2 figures, 3 tables)

This paper contains 17 sections, 2 figures, 3 tables.

Introduction
Automating Content Analysis: Past and Present
Case Study: W.E.B. Du Bois's Characterization in News Media
Results
Adapting a Codebook for an LLM
LLM-generated rationale are essential for evaluating performance.
LLMs require more precise descriptions than do human readers.
Prompting for machine-readable output.
Selecting a Model and Writing Prompts for Optimal Performance
GPT-4 greatly outperforms GPT-3.5.
Coding fidelty improves when codes are presented as individual tasks.
Coding fidelity improves when the model is prompted to justify its coding decisions.
Discussion
Determining appropriate domains for LLM-assisted qualitative coding.
Practical aspects of transitioning to content analysis with LLMs.
...and 2 more sections

Figures (2)

Figure 1: The chain-of-thought prompt sequence.
Figure 2: Two examples of prompt redefinition. Colored, alphabetically labeled blocks of text show alterations derived through iterative code refinement. Italics draw attention to direction to constrain interpretive scope to implicit or explicit information.

Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

TL;DR

Abstract

Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (2)