Table of Contents
Fetching ...

SWAG: Storytelling With Action Guidance

Zeeshan Patel, Karim El-Refai, Jonathan Pei, Tianle Li

TL;DR

SWAG addresses the challenge of engaging long-form storytelling by introducing a two-model feedback loop in which an action-discriminator LLM guides narrative direction and a story-generation model writes content. The AD LLM is trained via supervised fine-tuning and direct preference optimization on a GPT-4–generated action dataset, and the SWAG loop alternates between content generation and action selection to produce longer, more engaging narratives. Empirical results from both machine and human evaluations show SWAG substantially outperforms end-to-end generation baselines and can even surpass GPT-3.5-Turbo in several setups, while remaining compatible with open-source models through LongLoRA-based long-context fine-tuning. The proposed framework is modular and extensible, enabling fine-grained control over story progression and potential extensions such as test-time action generation or human-in-the-loop collaboration for diverse storytelling applications.

Abstract

Automated long-form story generation typically employs long-context large language models (LLMs) for one-shot creation, which can produce cohesive but not necessarily engaging content. We introduce Storytelling With Action Guidance (SWAG), a novel approach to storytelling with LLMs. Our approach frames story writing as a search problem through a two-model feedback loop: one LLM generates story content, and another auxiliary LLM is used to choose the next best "action" to steer the story's future direction. Our results show that SWAG can substantially outperform previous end-to-end story generation techniques when evaluated by GPT-4 and through human evaluation. Our SWAG pipeline using only small open-source models surpasses GPT-3.5-Turbo.

SWAG: Storytelling With Action Guidance

TL;DR

SWAG addresses the challenge of engaging long-form storytelling by introducing a two-model feedback loop in which an action-discriminator LLM guides narrative direction and a story-generation model writes content. The AD LLM is trained via supervised fine-tuning and direct preference optimization on a GPT-4–generated action dataset, and the SWAG loop alternates between content generation and action selection to produce longer, more engaging narratives. Empirical results from both machine and human evaluations show SWAG substantially outperforms end-to-end generation baselines and can even surpass GPT-3.5-Turbo in several setups, while remaining compatible with open-source models through LongLoRA-based long-context fine-tuning. The proposed framework is modular and extensible, enabling fine-grained control over story progression and potential extensions such as test-time action generation or human-in-the-loop collaboration for diverse storytelling applications.

Abstract

Automated long-form story generation typically employs long-context large language models (LLMs) for one-shot creation, which can produce cohesive but not necessarily engaging content. We introduce Storytelling With Action Guidance (SWAG), a novel approach to storytelling with LLMs. Our approach frames story writing as a search problem through a two-model feedback loop: one LLM generates story content, and another auxiliary LLM is used to choose the next best "action" to steer the story's future direction. Our results show that SWAG can substantially outperform previous end-to-end story generation techniques when evaluated by GPT-4 and through human evaluation. Our SWAG pipeline using only small open-source models surpasses GPT-3.5-Turbo.
Paper Structure (39 sections, 3 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 39 sections, 3 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: SWAG AD LLM Training Pipeline. After curating long story and action preference data from GPT-4, we perform SFT on a base open-source LLM, and then align our model with more preference data using DPO to produce our action discriminator model (AD LLM).
  • Figure 2: Original Distribution of Actions. We observe a severe distribution imbalance where the vast majority of actions selected is "add suspense". Note: actions chosen with frequency less than 100 not shown.
  • Figure 3: Rebalanced Distribution of Actions. After our rebalancing procedure, we observe a more uniform distribution among the top 5 actions chosen. Note: actions chosen with frequency less than 100 not shown.
  • Figure 4: SWAG Inference Loop. After sampling a story prompt and generating the initial paragraph, we pass the story state to our AD LLM to generate the next story action. The new state is passed back to the story model, and the process is repeated till a complete story is generated.
  • Figure 5: Comparing SWAG with only Llama-2-7B or Mistral-7B (as both AD and story generator) against GPT-3.5-Turbo E2E on human evaluation data. The win-rate is calculated by averaging wins, losses, and ties. We count win as a score of 1, tie as a score of 0.5, and loss as a score of 0. Notably, we observe that using SWAG with smaller open-source models outperforms the larger GPT-3.5 model.
  • ...and 3 more figures