Table of Contents
Fetching ...

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

Qianxun Xu, Chenxi Song, Yujun Cai, Chi Zhang

TL;DR

This work presents SwitchCraft, a training-free framework for multi-event video generation, and introduces Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts and Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity.

Abstract

Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.

SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls

TL;DR

This work presents SwitchCraft, a training-free framework for multi-event video generation, and introduces Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts and Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity.

Abstract

Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.
Paper Structure (20 sections, 23 equations, 11 figures, 5 tables)

This paper contains 20 sections, 23 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: SwitchCraft enables flexible multi-event video generation across multiple actions and scenes with smooth transitions. It steers attention to enhance prompt alignment and maintain coherent temporal evolution while preserving global information.
  • Figure 2: Overview of SwitchCraft. (a) EAQS takes a text prompt and user specified event time spans, identifies anchor tokens for each event, and constructs event specific projectors from their attention keys. It steers video queries toward the target event and away from others in each temporal span. (b) ABSS estimates enhancement and suppression strengths by extracting dominant directions from the event keys and correcting the attention deficit. The updated queries pass through the video diffusion transformer so that each temporal span follows its intended event with smooth transitions.
  • Figure 3: Qualitative comparison. SwitchCraft executes all events in the intended order, prevents leakage and omission, and maintains subject and scene consistency. Baselines show omissions, drift across spans, or progressive quality decay.
  • Figure 4: Failure of Stitch. Segment-wise generation inherits bias from the last frame of the previous segment and propagates motion and layout, causing action bleed and unstable handoffs.
  • Figure 5: More applications. SwitchCraft generates creative occluding transitions between scenes, preventing event bleeding while preserving identity and global context.
  • ...and 6 more figures