GPT-4 Generated Narratives of Life Events using a Structured Narrative Prompt: A Validation Study

Christopher J. Lynch; Erik Jensen; Madison H. Munro; Virginia Zamponi; Joseph Martinez; Kevin O'Brien; Brandon Feldhaus; Katherine Smith; Ann Marie Reinhold; Ross Gore

GPT-4 Generated Narratives of Life Events using a Structured Narrative Prompt: A Validation Study

Christopher J. Lynch, Erik Jensen, Madison H. Munro, Virginia Zamponi, Joseph Martinez, Kevin O'Brien, Brandon Feldhaus, Katherine Smith, Ann Marie Reinhold, Ross Gore

TL;DR

This study evaluates GPT-4 generated narratives of life events produced via a structured narrative prompt (SNP) and validates them through manual tagging and nine machine-learning classifiers. Generating $24{,}000$ narratives across birth, death, hiring, and firing, the authors sample $2{,}880$ narratives for labeling and then predict the remaining $21{,}120$ with an ensemble of models, achieving an overall SNP validity of $87.43\%$. The results show strong alignment between GPT-4 outputs and the structured prompt, though performance varies by event type, and ML models can reliably identify valid narratives while facing challenges with invalid cases due to data imbalance. The workflow demonstrates a scalable framework for automated evaluation and refinement of LLM-generated narratives, with implications for narrative generation, health/science communication, and NLP applications that require transparent prompt-driven outputs.

Abstract

Large Language Models (LLMs) play a pivotal role in generating vast arrays of narratives, facilitating a systematic exploration of their effectiveness for communicating life events in narrative form. In this study, we employ a zero-shot structured narrative prompt to generate 24,000 narratives using OpenAI's GPT-4. From this dataset, we manually classify 2,880 narratives and evaluate their validity in conveying birth, death, hiring, and firing events. Remarkably, 87.43% of the narratives sufficiently convey the intention of the structured prompt. To automate the identification of valid and invalid narratives, we train and validate nine Machine Learning models on the classified datasets. Leveraging these models, we extend our analysis to predict the classifications of the remaining 21,120 narratives. All the ML models excelled at classifying valid narratives as valid, but experienced challenges at simultaneously classifying invalid narratives as invalid. Our findings not only advance the study of LLM capabilities, limitations, and validity but also offer practical insights for narrative generation and natural language processing applications.

GPT-4 Generated Narratives of Life Events using a Structured Narrative Prompt: A Validation Study

TL;DR

narratives across birth, death, hiring, and firing, the authors sample

narratives for labeling and then predict the remaining

with an ensemble of models, achieving an overall SNP validity of

. The results show strong alignment between GPT-4 outputs and the structured prompt, though performance varies by event type, and ML models can reliably identify valid narratives while facing challenges with invalid cases due to data imbalance. The workflow demonstrates a scalable framework for automated evaluation and refinement of LLM-generated narratives, with implications for narrative generation, health/science communication, and NLP applications that require transparent prompt-driven outputs.

Abstract

Paper Structure (18 sections, 24 figures, 1 table)

This paper contains 18 sections, 24 figures, 1 table.

Introduction
Related Work
Experimentation
Preparing the Data
Manual Data Tagging
Model Generation using Tagged Data
Model Prediction on Remaining Untagged Data
Results
Structured Narrative Prompt Validity
ML Model Validity
ML Model Prediction Validity
Timing Considerations for ML Building and Predicting
Study Limitations
Conclusion
Agreement matrices of ML models' predicted classifications.
...and 3 more sections

Figures (24)

Figure 1: Research methodology. GPT-4 is prompted using an existing structured narrative prompt to produce 24,000 narratives across 4 life event types. These events undergo manual tagging to determine if each narrative meets the intention of its prompt. The tagged narratives are used to train and validate nine ML models. The validated ML models are then utilized to predict the classifications on the remaining 21,120 narratives.
Figure 2: Sample Structured Narrative Prompt setup utilized to prompt the LLM. Each prompt sent to the LLM contains unique information in fields 2-5 as they pertain to the subject and narrator characteristics.
Figure 3: Sample user interface during manual data tagging. Narratives start untagged and reviewers independently provide binary Yes/No tags for each narrative in their respective sets. Ties among reviewers are broken by an independent third party.
Figure 4: Fisher's exact test results on the confusion matrices for each model. P-values of 1.0 occur in cases where 0 No classifications occurred. Results are grouped by event type, (a) Birth event narratives, (b) Death event narratives, (c) Hired event narratives, and (d) Fired event narratives.
Figure 5: ML models' binary Yes (blue) / No (red) classification precision for (a) Birth event narratives, (b) Death event narratives, (c) Hired event narratives, and (d) Fired event narratives.
...and 19 more figures

GPT-4 Generated Narratives of Life Events using a Structured Narrative Prompt: A Validation Study

TL;DR

Abstract

GPT-4 Generated Narratives of Life Events using a Structured Narrative Prompt: A Validation Study

Authors

TL;DR

Abstract

Table of Contents

Figures (24)