Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

Andrew Katz; Gabriella Coloyan Fleming; Joyce Main

Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

Andrew Katz, Gabriella Coloyan Fleming, Joyce Main

TL;DR

The paper introduces the GATOS workflow, an open-source, multi-step approach to approximate inductive thematic analysis for qualitative data. Through three synthetic organizational datasets, it demonstrates how summaries, embeddings, clustering, and retrieval-augmented generation can yield codebooks and themes that closely align with ground-truth sub-themes. The study highlights the gains in scalability and reproducibility when using open-source large language models and embedding tools, while acknowledging limitations such as potential biases, redundancy, and the need for human oversight. Overall, GATOS offers a practical path toward scalable qualitative data analysis with transparent, auditable prompts and a rigorous evaluation against known ground truth.

Abstract

This paper aims to answer one central question: to what extent can open-source generative text models be used in a workflow to approximate thematic analysis in social science research? To answer this question, we present the Generative AI-enabled Theme Organization and Structuring (GATOS) workflow, which uses open-source machine learning techniques, natural language processing tools, and generative text models to facilitate thematic analysis. To establish validity of the method, we present three case studies applying the GATOS workflow, leveraging these models and techniques to inductively create codebooks similar to traditional procedures using thematic analysis. Specifically, we investigate the extent to which a workflow comprising open-source models and tools can inductively produce codebooks that approach the known space of themes and sub-themes. To address the challenge of gleaning insights from these texts, we combine open-source generative text models, retrieval-augmented generation, and prompt engineering to identify codes and themes in large volumes of text, i.e., generate a qualitative codebook. The process mimics an inductive coding process that researchers might use in traditional thematic analysis by reading text one unit of analysis at a time, considering existing codes already in the codebook, and then deciding whether or not to generate a new code based on whether the extant codebook provides adequate thematic coverage. We demonstrate this workflow using three synthetic datasets from hypothetical organizational research settings: a study of teammate feedback in teamwork settings, a study of organizational cultures of ethical behavior, and a study of employee perspectives about returning to their offices after the pandemic. We show that the GATOS workflow is able to identify themes in the text that were used to generate the original synthetic datasets.

Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

TL;DR

Abstract

Paper Structure (39 sections, 11 figures, 12 tables)

This paper contains 39 sections, 11 figures, 12 tables.

Introduction
Background
Qualitative Codebooks and Coding
Natural Language Processing and Machine Learning in Thematic Analysis
Method
Data Simulation
Simulated Dataset 1: Teammate feedback
Simulated Dataset 2: Organizational Cultures of Ethical Behavior
Simulated Dataset 3: Employee Perspectives About Returning to Their Workplaces After the Pandemic
GATOS Workflow Overview
GATOS Workflow in Detail
Step 1: Summarize the Original Data
Step 2: Clustering Semantically Similar Ideas
Step 3.1: Create Set of Speculative Starter Codes
Step 3.2: Inductive Codebook Generation
...and 24 more sections

Figures (11)

Figure 1: Data Generation Process
Figure 2: Distribution of Simulated Response Lengths for Teammate Feedback
Figure 3: Distribution of Response Lengths for Faculty Perspectives Upon Returning to Work After the Pandemic
Figure 4: Distribution of Response Lengths for Student Extracurricular Activity Participation
Figure 5: Workflow for the method
...and 6 more figures

Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

TL;DR

Abstract

Thematic Analysis with Open-Source Generative AI and Machine Learning: A New Method for Inductive Qualitative Codebook Development

Authors

TL;DR

Abstract

Table of Contents

Figures (11)