Table of Contents
Fetching ...

Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic Fidelity

Aliya Amirova, Theodora Fteropoulli, Nafiso Ahmed, Martin R. Cowie, Joel Z. Leibo

TL;DR

The need to establish epistemic norms now around how to assess the validity of LLM-based qualitative research is stressed, especially concerning the need to ensure the representation of heterogeneous lived experiences.

Abstract

Today, using Large-scale generative Language Models (LLMs) it is possible to simulate free responses to interview questions like those traditionally analyzed using qualitative research methods. Qualitative methodology encompasses a broad family of techniques involving manual analysis of open-ended interviews or conversations conducted freely in natural language. Here we consider whether artificial "silicon participants" generated by LLMs may be productively studied using qualitative methods aiming to produce insights that could generalize to real human populations. The key concept in our analysis is algorithmic fidelity, a term introduced by Argyle et al. (2023) capturing the degree to which LLM-generated outputs mirror human sub-populations' beliefs and attitudes. By definition, high algorithmic fidelity suggests latent beliefs elicited from LLMs may generalize to real humans, whereas low algorithmic fidelity renders such research invalid. Here we used an LLM to generate interviews with silicon participants matching specific demographic characteristics one-for-one with a set of human participants. Using framework-based qualitative analysis, we showed the key themes obtained from both human and silicon participants were strikingly similar. However, when we analyzed the structure and tone of the interviews we found even more striking differences. We also found evidence of the hyper-accuracy distortion described by Aher et al. (2023). We conclude that the LLM we tested (GPT-3.5) does not have sufficient algorithmic fidelity to expect research on it to generalize to human populations. However, the rapid pace of LLM research makes it plausible this could change in the future. Thus we stress the need to establish epistemic norms now around how to assess validity of LLM-based qualitative research, especially concerning the need to ensure representation of heterogeneous lived experiences.

Framework-Based Qualitative Analysis of Free Responses of Large Language Models: Algorithmic Fidelity

TL;DR

The need to establish epistemic norms now around how to assess the validity of LLM-based qualitative research is stressed, especially concerning the need to ensure the representation of heterogeneous lived experiences.

Abstract

Today, using Large-scale generative Language Models (LLMs) it is possible to simulate free responses to interview questions like those traditionally analyzed using qualitative research methods. Qualitative methodology encompasses a broad family of techniques involving manual analysis of open-ended interviews or conversations conducted freely in natural language. Here we consider whether artificial "silicon participants" generated by LLMs may be productively studied using qualitative methods aiming to produce insights that could generalize to real human populations. The key concept in our analysis is algorithmic fidelity, a term introduced by Argyle et al. (2023) capturing the degree to which LLM-generated outputs mirror human sub-populations' beliefs and attitudes. By definition, high algorithmic fidelity suggests latent beliefs elicited from LLMs may generalize to real humans, whereas low algorithmic fidelity renders such research invalid. Here we used an LLM to generate interviews with silicon participants matching specific demographic characteristics one-for-one with a set of human participants. Using framework-based qualitative analysis, we showed the key themes obtained from both human and silicon participants were strikingly similar. However, when we analyzed the structure and tone of the interviews we found even more striking differences. We also found evidence of the hyper-accuracy distortion described by Aher et al. (2023). We conclude that the LLM we tested (GPT-3.5) does not have sufficient algorithmic fidelity to expect research on it to generalize to human populations. However, the rapid pace of LLM research makes it plausible this could change in the future. Thus we stress the need to establish epistemic norms now around how to assess validity of LLM-based qualitative research, especially concerning the need to ensure representation of heterogeneous lived experiences.
Paper Structure (44 sections, 5 figures)

This paper contains 44 sections, 5 figures.

Figures (5)

  • Figure 1: Schematic representation of autoregressive sampling in large language models (LLMs). The diagram illustrates the iterative sampling process in three stages: input, processing by the LLM, and output. The LLM represents the probability distribution over all possible next words given the current context (previous words). Arrows indicate the flow of information, with solid arrows representing the transition from one stage to another within a single iteration, and dotted arrows indicating the progression from one iteration to the next. The outputted words ("a", "time", "there") are samples from the corresponding probability distributions and are appended to the context for the next iteration.
  • Figure 2: Mean quote fractions between human (green) and silico (amber) participants across TDF domains that were mentioned as positively influencing physical activity (i.e., physical activity enablers) grouped by active (top) and sedentary (bottom) status. TDF domains ordered by mean quote fraction:(1) Beliefs about Consequences (BCon), (2) Behavioural Regulation (BR), (3) Social influences (SI), (4) Goals (Gs), (5) Environmental Context and Resources (ECR), (6) Reinforcement (Rnfrt), (7) Optimism (Optm), (8) Social/Professional Role and Identity (SPR), (9) Emotion (Emtns), (10) Beliefs about Capabilities (BCap), (11) Knowledge (Knls), (12) Skills (Skls), (13) Intentions (Is), (14) Memory, Attention and Decision Processes (MADP). $*p<0.05; **p<0.01; ***p<0.005; ****p<0.001$.
  • Figure 3: Mean quote fractions between human (green) and silico (amber) participants across TDF domains that were mentioned as negatively influencing physical activity (i.e., physical activity barriers) grouped by active (top) and sedentary (bottom) status. TDF domains ordered by mean quote fraction: (1) Beliefs about Capabilities, (2) Beliefs about Consequences (BCon),(3) Environmental Context and Resources (ECR), (4) Goals (Gs), (5) Memory, Attention and Decision Processes (MADP), (6) Emotion (Emtns), (7) Skills (Skls), (8) Behavioural Regulation (BR), (9) Social/Professional Role and Identity (SPR), (10) Social influences (SI), (11) Optimism (Optm), (12) Knowledge (Knls), (13) Reinforcement (Rnfrt), (14) Intentions (Is). $*p<0.05; **p<0.01; ***p<0.005; ****p<0.001$.
  • Figure 4: Mean quote fractions between active (red) and sedentary (blue) human participants (top) and active and sedentary silicon participants (bottom) across TDF domains that were mentioned as positively influencing physical activity (i.e., physical activity enablers).TDF domains ordered by mean quote fraction:(1) Beliefs about Consequences (BCon), (2) Behavioural Regulation (BR), (3) Social influences (SI), (4) Goals (Gs), (5) Environmental Context and Resources (ECR), (6) Reinforcement (Rnfrt), (7) Optimism (Optm), (8) Social/Professional Role and Identity (SPR), (9) Emotion (Emtns), (10) Beliefs about Capabilities (BCap), (11) Knowledge (Knls), (12) Skills (Skls), (13) Intentions (Is), (14) Memory, Attention and Decision Processes (MADP). $*p<0.05; **p<0.01; ***p<0.005; ****p<0.001$.
  • Figure 5: Mean quote fractions between active (red) and sedentary (blue) human participants (top) and active and sedentary silicon participants (bottom) across TDF domains that were mentioned as having a negative influence on physical activity (i.e., physical activity barriers). TDF domains ordered by mean quote fraction: (1) Beliefs about Capabilities (BCap), (2) Beliefs about Consequences (BCon),(3) Environmental Context and Resources (ECR), (4) Goals (Gs), (5) Memory, Attention and Decision Processes (MADP), (6) Emotion (EMtns), (7) Skills (Skls), (8) Behavioural Regulation (BR), (9) Social/Professional Role and Identity (SPR), (10) Social influences (SI), (11) Optimism (Optm), (12) Knowledge (Knls), (13) Reinforcement (Rnfrt), (14) Intentions (Is). $*p<0.05; **p<0.01; ***p<0.005; ****p<0.001$.