Table of Contents
Fetching ...

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie

TL;DR

OneStory tackles long-range narrative coherence in multi-shot video generation by reframing MSV as next-shot autoregressive generation conditioned on prior shots and referential captions. It introduces a Frame Selection module to build a semantically relevant global memory and an Adaptive Conditioner for efficient, content-driven conditioning, enabling a global yet compact cross-shot context. A 60k-shot dataset with shot-level referential captions supports end-to-end training, and finetuning on this data yields state-of-the-art results in both text- and image-conditioned settings. The approach demonstrates robust cross-shot identity preservation, scene consistency, and narrative progression, suggesting practical impact for immersive long-form storytelling.

Abstract

Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

TL;DR

OneStory tackles long-range narrative coherence in multi-shot video generation by reframing MSV as next-shot autoregressive generation conditioned on prior shots and referential captions. It introduces a Frame Selection module to build a semantically relevant global memory and an Adaptive Conditioner for efficient, content-driven conditioning, enabling a global yet compact cross-shot context. A 60k-shot dataset with shot-level referential captions supports end-to-end training, and finetuning on this data yields state-of-the-art results in both text- and image-conditioned settings. The approach demonstrates robust cross-shot identity preservation, scene consistency, and narrative progression, suggesting practical impact for immersive long-form storytelling.

Abstract

Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.

Paper Structure

This paper contains 19 sections, 12 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Coherent multi-shot generations with OneStory. Each example shows 10-shots of a minute-long video. OneStory handles both image-to-multi-shot (top) and text-to-multi-shot (middle) generation within the same model, and generalizes well to out-of-domain scenes (bottom). It maintains consistent characters and environments while faithfully following complex and evolving prompts to produce coherent long-form narratives. A representative segment of each prompt is given with the corresponding shot. We recommend referring to our https://zhaochongan.github.io/projects/OneStory for better visualization.
  • Figure 2: Multi-shot video data curation pipeline. From raw videos, we obtain high-quality multi-shot sequences via three steps: (i) Shot detection, (ii) Two-stage captioning, and (iii) Quality filtering. In the second stage, each shot is first captioned independently and then rewritten into referential form based on preceding shots. Unlike prior datasets, no global captions are used, and only shot-level captions with progressive narrative flow are retained to ensure flexibility, while reflecting real-world storytelling.
  • Figure 3: Overview of the proposed OneStory. Our model reframes multi-shot video generation (MSV) as a next-shot generation task. (a) During training, the model learns to generate the final shot conditioned on the preceding two; when only two shots are available, we inflate with a synthetic shot to enable unified three-shot training. (b) At inference, it maintains a memory bank of past shots and generates multi-shot videos autoregressively. The model is comprised of two key components: (c) a Frame Selection module that selects semantically-relevant frames from preceding shots to construct a global context, and (d) an Adaptive Conditioner that dynamically compresses the selected context and injects it directly into the generator for efficient conditioning. Together, OneStory realizes adaptive memory modeling, enabling global yet compact cross-shot context for coherent narrative generation.
  • Figure 4: Patchification Comparison. Left: Prior fixed temporal schemes typically consider the most recent block of contiguous frames and assign patchifiers by temporal order (e.g., the finest patchifier for the latest frame). Right: Our adaptive scheme selects non-contiguous frames and allocates patchifiers based on content importance (i.e., finest patchifier for the most-important frame).
  • Figure 5: Qualitative results. For a fair comparison, the given multi-shot generations share the same first shot (generated by Wan2.2) as the initial condition, except for StoryDiff.+Wan2.1, which does not rely on visual conditioning. The baseline methods fail to maintain narrative consistency across shots, struggling with prompt adherence, reappearance, and compositional scenes, whereas OneStory (ours) faithfully follows shot-level captions and produces coherent shots. A representative segment of each prompt is given with the corresponding shot.
  • ...and 2 more figures