Table of Contents
Fetching ...

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang

TL;DR

StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes, delivers expressive interactions and evolving yet stable scenes.

Abstract

Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

TL;DR

StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes, delivers expressive interactions and evolving yet stable scenes.

Abstract

Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.
Paper Structure (27 sections, 13 equations, 18 figures, 7 tables)

This paper contains 27 sections, 13 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: Zero-shot on a single 24-GB GPU, our pipeline produces action-rich multi-subject narratives with consistent identities and smooth scene transitions. Right: baselines show background drag, weak interaction, and close-range conflicts; our GCA stabilizes subjects, AB-SVR boosts actions, and SFC retains transferable background cues.
  • Figure 2: GCA with SFC: the first 77 tokens run the main cross-attention with a subject mask; the remaining tokens form an IP branch where overlaps yield a dynamic Gaussian-decay mask applied to Q,K. Cross frame K,V caches recall layout while limiting background carry-over, and the IP output is scaled and fused with the main state.
  • Figure 3: Single frame image consistency on single- and multi-subject tasks. Top: single-subject action sequences. Bottom: multi-subject interaction and attribute binding—dog and cat rubbing each other, teddy bear with a backpack. Only Ours and Qwen-Edit attach the backpack.
  • Figure 4: Prompt-following for actions and interactions in single- and multi-subject cases. With the same references and prompts, our method preserves identity and binds attributes to the correct subject while rendering actions and interactions—standing in fog, running in a forest, jumping on a beach; dancing and holding hands, play-fighting, hugging, nestling—with cleaner backgrounds than $\lambda$-Eclipse, MS-Diffusion, FluxKontext, Qwen-Edit, Nano-Banana, and MS+1P1S.
  • Figure 5: Additional single-frame image consistency visual experiments.
  • ...and 13 more figures