Table of Contents
Fetching ...

Concat-ID: Towards Universal Identity-Preserving Video Synthesis

Yong Zhong, Zhuoyi Yang, Jiayan Teng, Xiaotao Gu, Chongxuan Li

TL;DR

Concat-ID introduces a unified, tuning-free framework for identity-preserving video synthesis by injecting VAE-extracted image latents into video latents along the sequence via 3D self-attention. A cross-video pairing strategy and a three-stage training regimen balance identity fidelity with facial editability and video naturalness, enabling scalable single-to-multi-identity and multi-subject generation without extra modules. Experiments on ConsistID-Benchmark show state-of-the-art identity consistency and editability, corroborated by user studies, and demonstrations of virtual try-on and background-controllable generation. The approach achieves strong scalability with minimal architectural changes, though body-structure fidelity under complex motions remains a future challenge.

Abstract

We present Concat-ID, a unified framework for identity-preserving video generation. Concat-ID employs variational autoencoders to extract image features, which are then concatenated with video latents along the sequence dimension. It relies exclusively on inherent 3D self-attention mechanisms to incorporate them, eliminating the need for additional parameters or modules. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance identity consistency and facial editability while enhancing video naturalness. Extensive experiments demonstrate Concat-ID's superiority over existing methods in both single and multi-identity generation, as well as its seamless scalability to multi-subject scenarios, including virtual try-on and background-controllable generation. Concat-ID establishes a new benchmark for identity-preserving video synthesis, providing a versatile and scalable solution for a wide range of applications.

Concat-ID: Towards Universal Identity-Preserving Video Synthesis

TL;DR

Concat-ID introduces a unified, tuning-free framework for identity-preserving video synthesis by injecting VAE-extracted image latents into video latents along the sequence via 3D self-attention. A cross-video pairing strategy and a three-stage training regimen balance identity fidelity with facial editability and video naturalness, enabling scalable single-to-multi-identity and multi-subject generation without extra modules. Experiments on ConsistID-Benchmark show state-of-the-art identity consistency and editability, corroborated by user studies, and demonstrations of virtual try-on and background-controllable generation. The approach achieves strong scalability with minimal architectural changes, though body-structure fidelity under complex motions remains a future challenge.

Abstract

We present Concat-ID, a unified framework for identity-preserving video generation. Concat-ID employs variational autoencoders to extract image features, which are then concatenated with video latents along the sequence dimension. It relies exclusively on inherent 3D self-attention mechanisms to incorporate them, eliminating the need for additional parameters or modules. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance identity consistency and facial editability while enhancing video naturalness. Extensive experiments demonstrate Concat-ID's superiority over existing methods in both single and multi-identity generation, as well as its seamless scalability to multi-subject scenarios, including virtual try-on and background-controllable generation. Concat-ID establishes a new benchmark for identity-preserving video synthesis, providing a versatile and scalable solution for a wide range of applications.

Paper Structure

This paper contains 22 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The architecture of Concat-ID. We utilize VAEs to extract image latents from reference images and concatenate them at the end of the video latents along the sequence dimension. Concat-ID relies solely on 3D self-attention mechanisms, which are commonly present in state-of-the-art video generation models, to integrate image features without adding extra modules or parameters.
  • Figure 2: Constructing three types of image-video pairs for a single identity: pre-training, cross-video and trade-off pairs.
  • Figure 3: Qualitative comparisons for single-identity generation. ID-Animator fails to preserve facial details, while ConsisID replicates the expressions of the reference images, particularly in the third case, where the semantic gap between texts and reference is significant. Concat-ID effectively preserves identity, while simultaneously preventing the direct replication of facial expressions from reference images.
  • Figure 4: Qualitative comparisons for multi-identity generation. Concat-ID better maintains different identities.
  • Figure 5: Human evaluation. Concat-ID produces more precise and natural videos while effectively preserving identity.
  • ...and 1 more figures