Counterfactual Story Reasoning and Generation
Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, Yejin Choi
TL;DR
This work introduces Counterfactual Story Rewriting as a robust testbed for causal narrative reasoning, presenting TimeTravel, a large-scale dataset of nearly 30k counterfactual rewritings built on ROCStories. By combining unsupervised and supervised training regimes across GPT and GPT-2 architectures, the study reveals that current language models capture some counterfactual cues but struggle to ensure full narrative consistency, underscoring the gap between surface language patterns and underlying causal reasoning. Human studies show larger models and ROCStories-tuned variants improve performance on premise and plot but still exhibit variability in enforcing adherence to counterfactual changes, highlighting evaluation challenges. The work proposes reconstruction and counterfactual objectives to foster minimal edits, and reveals that automatic metrics, especially overlap-based ones, inadequately reflect counterfactual understanding, thereby motivating future research in causally informed generation and more discriminative evaluation frameworks.
Abstract
Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes. Despite being considered a necessary component of AI-complete systems, few resources have been developed for evaluating counterfactual reasoning in narratives. In this paper, we propose Counterfactual Story Rewriting: given an original story and an intervening counterfactual event, the task is to minimally revise the story to make it compatible with the given counterfactual event. Solving this task will require deep understanding of causal narrative chains and counterfactual invariance, and integration of such story reasoning capabilities into conditional language generation models. We present TimeTravel, a new dataset of 29,849 counterfactual rewritings, each with the original story, a counterfactual event, and human-generated revision of the original story compatible with the counterfactual event. Additionally, we include 80,115 counterfactual "branches" without a rewritten storyline to support future work on semi- or un-supervised approaches to counterfactual story rewriting. Finally, we evaluate the counterfactual rewriting capacities of several competitive baselines based on pretrained language models, and assess whether common overlap and model-based automatic metrics for text generation correlate well with human scores for counterfactual rewriting.
