Measuring Psychological Depth in Language Models

Fabrice Harel-Canada; Hanyu Zhou; Sreya Muppalla; Zeynep Yildiz; Miryung Kim; Amit Sahai; Nanyun Peng

Measuring Psychological Depth in Language Models

Fabrice Harel-Canada, Hanyu Zhou, Sreya Muppalla, Zeynep Yildiz, Miryung Kim, Amit Sahai, Nanyun Peng

TL;DR

The paper introduces the Psychological Depth Scale (PDS), a reader-centered framework grounded in reader-response criticism and text-world theory to quantify psychological depth across five metacomponents: empathy, engagement, emotion provocation, authenticity, and narrative complexity. It validates PDS through a human study with substantial inter-rater agreement ($\langle K-$alpha$\rangle \approx 0.72$) and demonstrates automated depth assessment using Mixture-of-Personas prompting, with GPT-4o achieving an average correlation of $\approx$ $0.51$ with human judgments and Llama-3-70B reaching $0.68$ for empathy and $0.62$ for narrative complexity. The study compares human and LLM-generated stories, finding GPT-4 often matching or surpassing Reddit-human stories on several depth components and being perceived as human-written by around $73\%$ of readers. Overall, PDS provides a validated, scalable means to measure how effectively stories—whether human or machine authored—connect with readers, with implications for evaluation, prompting strategies, and future research into deeper AI-assisted storytelling.

Abstract

Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and diversity. While these metrics are indispensable, they do not speak to a story's subjective, psychological impact from a reader's perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of 0.51 with human judgment while Llama-3-70B with constrained decoding scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.

Measuring Psychological Depth in Language Models

TL;DR

alpha

) and demonstrates automated depth assessment using Mixture-of-Personas prompting, with GPT-4o achieving an average correlation of

with human judgments and Llama-3-70B reaching

for empathy and

for narrative complexity. The study compares human and LLM-generated stories, finding GPT-4 often matching or surpassing Reddit-human stories on several depth components and being perceived as human-written by around

of readers. Overall, PDS provides a validated, scalable means to measure how effectively stories—whether human or machine authored—connect with readers, with implications for evaluation, prompting strategies, and future research into deeper AI-assisted storytelling.

Abstract

Paper Structure (31 sections, 9 figures, 10 tables)

This paper contains 31 sections, 9 figures, 10 tables.

Introduction
The Psychological Depth Scale
PsychDepth Dataset
Human Stories
LLM Stories
Prompting Strategies
Human Study
Results
RQ1. Consistency of Human Judgments
RQ2. LLM-as-Judge for Measuring Psychological Depth
RQ3. Comparing Psychological Depth in Human and LLM Stories
Discussion
Related Work
Evaluating Creative Writing
Creative Generation by LLMs
...and 16 more sections

Figures (9)

Figure 1: In this GPT-4 story, the psychological depth scale highlights strengths and weaknesses contributing to the overall reader experience, providing additional quality signals over traditional metrics more likely to saturate. Scores are normalized 1-5 for comparison.
Figure 2: Overview of our approach to developing and validating the Psychological Depth Scale. We merge related metrics from an extensive survey of literary theory and reader-response analysis, then generate deep stories using LLMs, and finally compare annotations from both human evaluators and automated systems across five key dimensions: authenticity, narrative complexity, empathy, engagement, and emotion provocation.
Figure 3: Illustration of WriterProfile's template, which prompts an LLM to generate stories based on a premise and a writer profile.
Figure 4: Illustration of Plan+Write's workflow, which prompts an LLM for character portraits given a premise prior to story generation.
Figure 5: Screenshots taken from the Warm-Up tutorial instructions shown to study participants. All fields are similar to the ones used in the main annotation forms.
...and 4 more figures

Measuring Psychological Depth in Language Models

TL;DR

Abstract

Measuring Psychological Depth in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)