Table of Contents
Fetching ...

Text-to-Stage: Spatial Layouts from Long-form Narratives

Jefferson Hernandez, Swarnadeep Saha, Chenxi Whitehouse, Sanjeel Parekh, Calvin Murdock, Yuliang Li, W. Owen Brimijoin, Vamsi Krishna Ithapu, Ishwarya Ananthabhotla

Abstract

In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.

Text-to-Stage: Spatial Layouts from Long-form Narratives

Abstract

In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.
Paper Structure (25 sections, 1 equation, 8 figures, 5 tables)

This paper contains 25 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We propose a Dramaturg Spatializer that transforms long-form narrative into structured stage layouts. The model assigns dialogue and characters to canonical positions on a discrete grid---depth (front, middle, back) and lateral placement (left, center, right)---with temporal blocking to produce plausible staging from text. This automates a "text-to-play" process for applications such as game previsualization and spatial audiobooks.
  • Figure 2: Overview of our Spatializer training pipeline.Left: A strong teacher LLM produces pseudo stage-play annotations via Best-of-$N$ sampling and rejection SFT under our deterministic dramaturgy evaluator. Right: The resulting Spatializer-SFT is post-trained with RL from verifiable rewards (GRPO): for each passage, we sample candidate spatializations, score them with the deterministic evaluator, compute group-relative advantages (with KL regularization to a frozen reference model), and update the policy to obtain Spatializer-GRPO. The model outputs a reasoning trace (e.g., per-quote placement plan) and the corresponding stage-grid layout.
  • Figure 3: Additional analyses.(a) Distribution of per-example AVG scores over GRPO, SFT, and Baseline models. RL training shifts mass and reduces degenerate failures, indicating improved reliability beyond mean score gains. (b) Deterministic macro score (%) versus GRPO steps when swapping the backbone (Qwen3 vs. LLaMA 3.3); the horizontal line denotes the best proprietary API baseline (83.45%). (c) ROC curve for a logistic regression that predicts whether listeners prefer audio $A$ over $B$ from the deterministic dramaturgy score difference $\Delta s = s(A)-s(B)$. The classifier achieves $\mathrm{AUC}=0.701$ with Brier score $0.186$.
  • Figure 4: Challenging qualitative cases for spatialization diagnostics.Left: case probing consistency between generated reasoning and emitted layout. Right: case probing continuity when new characters enter an already-populated scene.
  • Figure B.1: Extra results: test-time selection and training-data scaling.(a) Best-of-$N$ at inference time for Qwen3-8B: we sample $N$ candidate spatializations per passage and select the highest-scoring one using three ranking signals---Same Model (self-judging with the LLM-as-a-judge prompt), Bigger Model (GPT-4.1 as judge with the same prompt), and our deterministic dramaturgy evaluator (Oracle). Increasing $N$ improves the selected output for all scorers, with the largest gains under the deterministic oracle. (b) Data scaling for Spatializer-SFT: we fine-tune the same base model with increasing numbers of supervised training examples and evaluate the resulting model on the same test set. Performance rises monotonically with dataset size, with diminishing marginal gains at larger scales.
  • ...and 3 more figures