Table of Contents
Fetching ...

Counterfactual Story Reasoning and Generation

Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, Yejin Choi

TL;DR

This work introduces Counterfactual Story Rewriting as a robust testbed for causal narrative reasoning, presenting TimeTravel, a large-scale dataset of nearly 30k counterfactual rewritings built on ROCStories. By combining unsupervised and supervised training regimes across GPT and GPT-2 architectures, the study reveals that current language models capture some counterfactual cues but struggle to ensure full narrative consistency, underscoring the gap between surface language patterns and underlying causal reasoning. Human studies show larger models and ROCStories-tuned variants improve performance on premise and plot but still exhibit variability in enforcing adherence to counterfactual changes, highlighting evaluation challenges. The work proposes reconstruction and counterfactual objectives to foster minimal edits, and reveals that automatic metrics, especially overlap-based ones, inadequately reflect counterfactual understanding, thereby motivating future research in causally informed generation and more discriminative evaluation frameworks.

Abstract

Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes. Despite being considered a necessary component of AI-complete systems, few resources have been developed for evaluating counterfactual reasoning in narratives. In this paper, we propose Counterfactual Story Rewriting: given an original story and an intervening counterfactual event, the task is to minimally revise the story to make it compatible with the given counterfactual event. Solving this task will require deep understanding of causal narrative chains and counterfactual invariance, and integration of such story reasoning capabilities into conditional language generation models. We present TimeTravel, a new dataset of 29,849 counterfactual rewritings, each with the original story, a counterfactual event, and human-generated revision of the original story compatible with the counterfactual event. Additionally, we include 80,115 counterfactual "branches" without a rewritten storyline to support future work on semi- or un-supervised approaches to counterfactual story rewriting. Finally, we evaluate the counterfactual rewriting capacities of several competitive baselines based on pretrained language models, and assess whether common overlap and model-based automatic metrics for text generation correlate well with human scores for counterfactual rewriting.

Counterfactual Story Reasoning and Generation

TL;DR

This work introduces Counterfactual Story Rewriting as a robust testbed for causal narrative reasoning, presenting TimeTravel, a large-scale dataset of nearly 30k counterfactual rewritings built on ROCStories. By combining unsupervised and supervised training regimes across GPT and GPT-2 architectures, the study reveals that current language models capture some counterfactual cues but struggle to ensure full narrative consistency, underscoring the gap between surface language patterns and underlying causal reasoning. Human studies show larger models and ROCStories-tuned variants improve performance on premise and plot but still exhibit variability in enforcing adherence to counterfactual changes, highlighting evaluation challenges. The work proposes reconstruction and counterfactual objectives to foster minimal edits, and reveals that automatic metrics, especially overlap-based ones, inadequately reflect counterfactual understanding, thereby motivating future research in causally informed generation and more discriminative evaluation frameworks.

Abstract

Counterfactual reasoning requires predicting how alternative events, contrary to what actually happened, might have resulted in different outcomes. Despite being considered a necessary component of AI-complete systems, few resources have been developed for evaluating counterfactual reasoning in narratives. In this paper, we propose Counterfactual Story Rewriting: given an original story and an intervening counterfactual event, the task is to minimally revise the story to make it compatible with the given counterfactual event. Solving this task will require deep understanding of causal narrative chains and counterfactual invariance, and integration of such story reasoning capabilities into conditional language generation models. We present TimeTravel, a new dataset of 29,849 counterfactual rewritings, each with the original story, a counterfactual event, and human-generated revision of the original story compatible with the counterfactual event. Additionally, we include 80,115 counterfactual "branches" without a rewritten storyline to support future work on semi- or un-supervised approaches to counterfactual story rewriting. Finally, we evaluate the counterfactual rewriting capacities of several competitive baselines based on pretrained language models, and assess whether common overlap and model-based automatic metrics for text generation correlate well with human scores for counterfactual rewriting.

Paper Structure

This paper contains 38 sections, 5 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Given a short story (left column) and a counterfactual context ("He decided to be a werewolf this year"), the task is to revise the original story with minimal edits to be consistent with both the original premise ("Pierre loved Halloween") and the new counterfactual situation. The modified parts in the new story (right column) are highlighted in red.
  • Figure 2: Data annotation process for the TimeTravel dataset. Given a story from the ROCStories corpus, crowdworkers write a counterfactual sentence w.r.t the second sentence of the story. The counterfactual sentence and the original story are then presented to other workers to rewrite the story ending. Models for the task are expected to generate a rewritten ending given the original story and counterfactual sentence.