SAGA: A Participant-specific Examination of Story Alternatives and Goal Applicability for a Deeper Understanding of Complex Events
Sai Vallurupalli, Katrin Erk, Francis Ferraro
TL;DR
SAGA introduces a participant-centric framework for goal reasoning in complex narratives, building a large, crowd-sourced dataset of 6.3K annotated goal and action instances across 886 actual and 951 alternative ROCStories-derived narratives with 80% average IAA. The authors formulate five inference tasks—ranging from goal inference to plan and achievement—in order to benchmark multiple LLMs (GPT-4/3.5, Flan-T5, T5) under zero- and few-shot settings and with fine-tuning on SAGA data. Results show larger pretrained models generally outperform smaller ones but fine-tuned smaller models can surpass or match larger models on several tasks, particularly when augmented with few-shot prompts. The work demonstrates that while big models capture some goal-based knowledge, finer-grained, participant-specific goal reasoning and prospective planning benefit substantially from curated datasets and targeted fine-tuning, with implications for claim verification, question answering, and narrative understanding.
Abstract
Interpreting and assessing goal driven actions is vital to understanding and reasoning over complex events. It is important to be able to acquire the knowledge needed for this understanding, though doing so is challenging. We argue that such knowledge can be elicited through a participant achievement lens. We analyze a complex event in a narrative according to the intended achievements of the participants in that narrative, the likely future actions of the participants, and the likelihood of goal success. We collect 6.3K high quality goal and action annotations reflecting our proposed participant achievement lens, with an average weighted Fleiss-Kappa IAA of 80%. Our collection contains annotated alternate versions of each narrative. These alternate versions vary minimally from the "original" story, but can license drastically different inferences. Our findings suggest that while modern large language models can reflect some of the goal-based knowledge we study, they find it challenging to fully capture the design and intent behind concerted actions, even when the model pretraining included the data from which we extracted the goal knowledge. We show that smaller models fine-tuned on our dataset can achieve performance surpassing larger models.
