Table of Contents
Fetching ...

EchoShot: Multi-Shot Portrait Video Generation

Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, Jieping Ye

TL;DR

EchoShot tackles multi-shot portrait video generation by enabling native, identity-consistent generation across multiple shots with flexible per-shot prompts. It introduces shot-aware rotary position embeddings TcRoPE and TaRoPE to model inter-shot boundaries and shot-to-caption alignment, trained on the new PortraitGala dataset. The framework supports personalized (PMT2V) and infinite (InfT2V) video generation through additional conditioning and RefAttn mechanisms. Empirical results show superior identity preservation, controllability, and visual quality over baselines, indicating EchoShot as a foundational approach for multi-shot video modeling.

Abstract

Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model training within multi-shot scenario, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts. Extensive evaluations demonstrate that EchoShot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling.

EchoShot: Multi-Shot Portrait Video Generation

TL;DR

EchoShot tackles multi-shot portrait video generation by enabling native, identity-consistent generation across multiple shots with flexible per-shot prompts. It introduces shot-aware rotary position embeddings TcRoPE and TaRoPE to model inter-shot boundaries and shot-to-caption alignment, trained on the new PortraitGala dataset. The framework supports personalized (PMT2V) and infinite (InfT2V) video generation through additional conditioning and RefAttn mechanisms. Empirical results show superior identity preservation, controllability, and visual quality over baselines, indicating EchoShot as a foundational approach for multi-shot video modeling.

Abstract

Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model training within multi-shot scenario, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts. Extensive evaluations demonstrate that EchoShot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling.

Paper Structure

This paper contains 27 sections, 19 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Given multiple formatted prompts of the same character, EchoShot generates multi-shot portrait videos showing the same appearance with superior fine-grained controllability.
  • Figure 2: (a) The overall architecture of EchoShot, a multi-shot video generation paradigm, which features two intricate RoPE mechanisms. (b)TcRoPE, a 3D-RoPE which rotates an extra angular rotation at every inter-shot boundary along the time dimension. (c)TaRoPE, a 1D-RoPE which differentiates between matching and non-matching shot-caption pairs. Note that the visualization displays only one rotational component, with others excluded for simplicity.
  • Figure 3: Two enhanced pipelines based on MT2V model. (a) PMT2V pipeline, with a integrated conditioner branch, generates multi-shot portrait videos of a given face input. (b) InfT2V pipeline creates infinite shots of the same person across multiple generation attempts, enabled by RefAttn, which disentangles the first shot as a constant reference.
  • Figure 4: (a) A caption case of PortraitGala. Each clip is thoroughly captioned in the fine-grained format. (b) The word cloud reflects the comprehensiveness of the captions. (c) PortraitGala consists of 650,000 clips with 400,000 IDs, totaling a video duration of 1,000 hours.
  • Figure 5: Illustration of EchoShot and baselines in the MT2V task. Key prompts are marked blue. Our method demonstrates superior appearance consistency and fine-grained controllability.
  • ...and 9 more figures