DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

Zun Wang; Jialu Li; Han Lin; Jaehong Yoon; Mohit Bansal

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, Mohit Bansal

TL;DR

DreamRunner tackles the challenge of fine-grained, multi-entity storytelling video generation by integrating dual-level LLM planning, retrieval-augmented motion and subject priors learned via test-time adaptation, and a spatial-temporal region-based diffusion module (SR3AI) with region-specific attention and LoRA injection. The approach yields state-of-the-art results in SVG and compositional T2V tasks, improving character consistency, text alignment, and smooth transitions while generalizing to multi-character scenarios. Key contributions include a retrieval-augmented prior learning pipeline, per-video prompts for motion priors, and a region-conditioned diffusion mechanism that tightly binds objects to their actions across frames. These innovations enable more faithful, controllable, and scalable story-to-video generation with practical implications for media, storytelling, and interactive AI systems.

Abstract

Storytelling video generation (SVG) aims to produce coherent and visually rich multi-scene videos that follow a structured narrative. Existing methods primarily employ LLM for high-level planning to decompose a story into scene-level descriptions, which are then independently generated and stitched together. However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple characters and events, complex motion synthesis and multi-character customization. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. Next, DREAMRUNNER presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos, thus facilitating the generation of new videos with complex, scripted motions. Lastly, we propose a novel spatial-temporal region-based 3D attention and prior injection module SR3AI for fine-grained object-motion binding and frame-by-frame spatial-temporal semantic control. We compare DREAMRUNNER with various SVG baselines, demonstrating state-of-the-art performance in character consistency, text alignment, and smooth transitions. Additionally, DREAMRUNNER exhibits strong fine-grained condition-following ability in compositional text-to-video generation, significantly outperforming baselines on T2V-ComBench. Finally, we validate DREAMRUNNER's robust ability to generate multi-object interactions with qualitative examples.

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

TL;DR

Abstract

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)