Table of Contents
Fetching ...

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal

TL;DR

DreamRunner tackles the challenge of fine-grained, multi-entity storytelling video generation by integrating dual-level LLM planning, retrieval-augmented motion and subject priors learned via test-time adaptation, and a spatial-temporal region-based diffusion module (SR3AI) with region-specific attention and LoRA injection. The approach yields state-of-the-art results in SVG and compositional T2V tasks, improving character consistency, text alignment, and smooth transitions while generalizing to multi-character scenarios. Key contributions include a retrieval-augmented prior learning pipeline, per-video prompts for motion priors, and a region-conditioned diffusion mechanism that tightly binds objects to their actions across frames. These innovations enable more faithful, controllable, and scalable story-to-video generation with practical implications for media, storytelling, and interactive AI systems.

Abstract

Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and multi-character customization. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DREAMRUNNER's robust ability to generate multi-object interactions with qualitative examples.

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

TL;DR

DreamRunner tackles the challenge of fine-grained, multi-entity storytelling video generation by integrating dual-level LLM planning, retrieval-augmented motion and subject priors learned via test-time adaptation, and a spatial-temporal region-based diffusion module (SR3AI) with region-specific attention and LoRA injection. The approach yields state-of-the-art results in SVG and compositional T2V tasks, improving character consistency, text alignment, and smooth transitions while generalizing to multi-character scenarios. Key contributions include a retrieval-augmented prior learning pipeline, per-video prompts for motion priors, and a region-conditioned diffusion mechanism that tightly binds objects to their actions across frames. These innovations enable more faithful, controllable, and scalable story-to-video generation with practical implications for media, storytelling, and interactive AI systems.

Abstract

Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and multi-character customization. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DREAMRUNNER's robust ability to generate multi-object interactions with qualitative examples.

Paper Structure

This paper contains 33 sections, 8 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Overall pipeline for DreamRunner.(1) plan generation stage: we employ an LLM to craft a hierarchical video plan (i.eet@token., "High-Level Plan" and "Fine-Grained Plan") from a user-provided generic story narration. (2.1) motion retrieval and prior learning stage: we retrieve videos relevant to the desired motions from a video database for learning the motion prior through test-time fine-tuning. (2.2) subject prior learning stage: we use reference images for learning the subject prior through test-time fine-tuning. (3) video generation with region-based diffusion stage: we equip diffusion model with a novel spatial-temporal region-based 3D attention and prior injection module (i.eet@token., SR3AI) for video generation with fine-grained control.
  • Figure 2: Implementation details for region-based diffusion. We extend the vanilla self-attention mechanism to spatial-temporal-region-based 3D attention (see upper orange part), which is capable of aligning different regions with their respective text descriptions via region-specific masks. The region-based character and motion LoRAs (see lower yellow and blue parts) are then injected interleavingly to the attention and FFN layers in each transformer block (see the right part). Note that though we resize the visual latents into sequential 2D latent frames for better visualization, they are flattened and concatenated with all conditions when performing region-based attention. Fig. \ref{['fig:mask']} and Appendix \ref{['sr3i']} provide example of the region-based attention mask and more details of region-based LoRA injection, respectively.
  • Figure 3: Visualization of spatial-temporal region-based 3D attention mask. Different text colors represent different conditions, while the white region indicates masked areas. For simplicity, we reduce each condition to two words, each frame to three segments, and display only three conditions and two frames in the figure. In practice, conditions can be longer and more numerous, frames can have more segments, and there are 12 latent frames in total.
  • Figure 4: Qualitative comparison and ablations of DreamRunner on SVG. In (a) multi-character example, DreamRunner produces significantly better character consistency compared to other strong baselines, while others fail to maintain object consistency (e.g., VLogger), or fail to generate multiple objects ((a) Row 2,4). In (b) single-character setting, integrating SR3AI and locally-injected priors consistently improve overall quality, complex motion synthesis and coherent composition. Note that in the overlapped regions in (b) row 3, the caption is a merge of the two. For cleaner visualization, we don’t show it here.
  • Figure 5: Generated multi-character videos with slightly overlapping regions.
  • ...and 11 more figures