Table of Contents
Fetching ...

Collecting Qualitative Data at Scale with Large Language Models: A Case Study

Alejandro Cuevas, Jennifer V. Scurrell, Eva M. Brown, Jason Entenmann, Madeleine I. G. Daepp

TL;DR

This study empirically evaluates two LLM-based chatbot modules (Dynamic Prober and Member Checker) against a baseline in a large-scale qualitative data collection task (AI alignment) and introduces a novel richness framework encompassing cognitive empathy and palpability. Although the LLM-infused chatbots achieve high traditional quality and superior engagement over a naive baseline, they fail to elicit rich, personalized data comparable to human interviews, revealing a persistent richness gap. The authors also demonstrate limited reliability in using LLMs to code qualitative data, even with GPT-4, highlighting the necessity of human-in-the-loop and critical evaluation of AI-assisted methods. The work cautions researchers about the current boundaries of automated qualitative interviewing, while offering open-source tools and guiding principles for more robust future development and evaluation.

Abstract

Chatbots have shown promise as tools to scale qualitative data collection. Recent advances in Large Language Models (LLMs) could accelerate this process by allowing researchers to easily deploy sophisticated interviewing chatbots. We test this assumption by conducting a large-scale user study (n=399) evaluating 3 different chatbots, two of which are LLM-based and a baseline which employs hard-coded questions. We evaluate the results with respect to participant engagement and experience, established metrics of chatbot quality grounded in theories of effective communication, and a novel scale evaluating "richness" or the extent to which responses capture the complexity and specificity of the social context under study. We find that, while the chatbots were able to elicit high-quality responses based on established evaluation metrics, the responses rarely capture participants' specific motives or personalized examples, and thus perform poorly with respect to richness. We further find low inter-rater reliability between LLMs and humans in the assessment of both quality and richness metrics. Our study offers a cautionary tale for scaling and evaluating qualitative research with LLMs.

Collecting Qualitative Data at Scale with Large Language Models: A Case Study

TL;DR

This study empirically evaluates two LLM-based chatbot modules (Dynamic Prober and Member Checker) against a baseline in a large-scale qualitative data collection task (AI alignment) and introduces a novel richness framework encompassing cognitive empathy and palpability. Although the LLM-infused chatbots achieve high traditional quality and superior engagement over a naive baseline, they fail to elicit rich, personalized data comparable to human interviews, revealing a persistent richness gap. The authors also demonstrate limited reliability in using LLMs to code qualitative data, even with GPT-4, highlighting the necessity of human-in-the-loop and critical evaluation of AI-assisted methods. The work cautions researchers about the current boundaries of automated qualitative interviewing, while offering open-source tools and guiding principles for more robust future development and evaluation.

Abstract

Chatbots have shown promise as tools to scale qualitative data collection. Recent advances in Large Language Models (LLMs) could accelerate this process by allowing researchers to easily deploy sophisticated interviewing chatbots. We test this assumption by conducting a large-scale user study (n=399) evaluating 3 different chatbots, two of which are LLM-based and a baseline which employs hard-coded questions. We evaluate the results with respect to participant engagement and experience, established metrics of chatbot quality grounded in theories of effective communication, and a novel scale evaluating "richness" or the extent to which responses capture the complexity and specificity of the social context under study. We find that, while the chatbots were able to elicit high-quality responses based on established evaluation metrics, the responses rarely capture participants' specific motives or personalized examples, and thus perform poorly with respect to richness. We further find low inter-rater reliability between LLMs and humans in the assessment of both quality and richness metrics. Our study offers a cautionary tale for scaling and evaluating qualitative research with LLMs.
Paper Structure (38 sections, 7 figures, 5 tables)

This paper contains 38 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Chatbot User Interface. Generated text is indicated in grey with questions made more salient to the user through yellow highlighting. User text is indicated in blue. The image presents an examples from a dialogue in which (1) the chatbot first showed a question from an established, validated measurement scale jakesch2022different, modified to use less formal language kim2019 and then (2) accessed the "Dynamic Prober" module to generate a follow-up question based on the user's response. After the user responded to three questions and their associated follow-up probes, (3) shows the "Member Checker" module in which the chatbot generated a summary of the conversation and asked the user to confirm the summary content.
  • Figure 2: Experimental Design. Participants began the study by answering questions on AI alignment. They were then evenly randomized across three groups and taken to the Web-based chatbot interface. Group (1) was given hard-coded questions. Group (2) received the "Dynamic Prober" module. Group (3) interacted with the "Dynamic Prober" and "Member Checker" modules. All participants were then asked to complete questions on their experience and demographics.
  • Figure 3: Participant ratings of their experience with the chatbot on an 11-point Likert scale from "very dissatisfied" to "very satisfied."
  • Figure 4: Participant ratings of their experience with the chatbot in comparison with a survey (left panel) or a human interviewer (right panel) on an 11-point likert-scale.
  • Figure 5: Overview of codes for each indicator described in Table \ref{['tab:indicator-defs']} across study groups. Each group contains 39 coded segments. Besides the"follow-up" indicator, no significant differences are observed across groups. Informativeness is omitted given that it's measured in bits.
  • ...and 2 more figures