LLM for Comparative Narrative Analysis

Leo Kampen; Carlos Rabat Villarreal; Louis Yu; Santu Karmaker; Dongji Feng

LLM for Comparative Narrative Analysis

Leo Kampen, Carlos Rabat Villarreal, Louis Yu, Santu Karmaker, Dongji Feng

TL;DR

This work formalizes Comparative Narrative Analysis (CNA) as a constrained multi-sequence task that fuses four narrative perspectives—Overlap, Conflict, Holistic, and Unique—into a single summary, using $N_o = \mathcal{F}(O, C, H, U)$ and $N_o = \alpha O + \beta C + \gamma H + \delta U$ to define the integration. It adopts the TELeR prompt taxonomy to design 4 levels of prompts and evaluates three large-language models (GPT-3.5, PaLM2, Llama2) on 5 narrative pairs derived from AllSides, with 4 subtasks and 4 prompt levels, yielding 240 summarizations and 24,000 human-rated data points. The key findings show that prompt level 4 generally yields the best human-aligned performance, but models diverge in strengths: GPT-3.5 excels at Unique, Llama2 at Conflict, PaLM2 struggles with Overlapping, and overall rankings are GPT-3.5 > Llama2 > PaLM2. This highlights the necessity of consistent prompting and human evaluation for CNA tasks and points to TELeR as a practical framework for fair cross-model benchmarking with potential for richer datasets and automated metrics in future work.

Abstract

In this paper, we conducted a Multi-Perspective Comparative Narrative Analysis (CNA) on three prominent LLMs: GPT-3.5, PaLM2, and Llama2. We applied identical prompts and evaluated their outputs on specific tasks, ensuring an equitable and unbiased comparison between various LLMs. Our study revealed that the three LLMs generated divergent responses to the same prompt, indicating notable discrepancies in their ability to comprehend and analyze the given task. Human evaluation was used as the gold standard, evaluating four perspectives to analyze differences in LLM performance.

LLM for Comparative Narrative Analysis

TL;DR

and

to define the integration. It adopts the TELeR prompt taxonomy to design 4 levels of prompts and evaluates three large-language models (GPT-3.5, PaLM2, Llama2) on 5 narrative pairs derived from AllSides, with 4 subtasks and 4 prompt levels, yielding 240 summarizations and 24,000 human-rated data points. The key findings show that prompt level 4 generally yields the best human-aligned performance, but models diverge in strengths: GPT-3.5 excels at Unique, Llama2 at Conflict, PaLM2 struggles with Overlapping, and overall rankings are GPT-3.5 > Llama2 > PaLM2. This highlights the necessity of consistent prompting and human evaluation for CNA tasks and points to TELeR as a practical framework for fair cross-model benchmarking with potential for richer datasets and automated metrics in future work.

LLM for Comparative Narrative Analysis

TL;DR

Abstract

LLM for Comparative Narrative Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)