Table of Contents
Fetching ...

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang, Ying Shan, Yinqiang Zheng

TL;DR

The paper introduces Spatially-Conditioned Diffusion (SCD) to achieve consistent human image and video generation by treating reference-based synthesis as spatial inpainting within a single denoising network. It employs spatial conditioning to align reference and target features in the same manifold and implements a causal feature interaction that restricts reference features to self-attention while allowing target features to attend to both reference and target features, thereby preserving fine-grained appearance details. The method is decomposed into two stages—reference appearance extraction and conditioned target generation—executed within a unified network to improve efficiency. Extensive experiments on image and video datasets show competitive or superior performance against state-of-the-art approaches like Disco, DreamPose, and Animate-related methods, with ablations confirming the value of the causal interaction and reference-pose injection. The results indicate strong generalization to unseen identities and poses without per-instance fine-tuning, and the approach offers practical versatility for applications such as visual try-on and face reenactment.

Abstract

Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

TL;DR

The paper introduces Spatially-Conditioned Diffusion (SCD) to achieve consistent human image and video generation by treating reference-based synthesis as spatial inpainting within a single denoising network. It employs spatial conditioning to align reference and target features in the same manifold and implements a causal feature interaction that restricts reference features to self-attention while allowing target features to attend to both reference and target features, thereby preserving fine-grained appearance details. The method is decomposed into two stages—reference appearance extraction and conditioned target generation—executed within a unified network to improve efficiency. Extensive experiments on image and video datasets show competitive or superior performance against state-of-the-art approaches like Disco, DreamPose, and Animate-related methods, with ablations confirming the value of the causal interaction and reference-pose injection. The results indicate strong generalization to unseen identities and poses without per-instance fine-tuning, and the approach offers practical versatility for applications such as visual try-on and face reenactment.

Abstract

Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.

Paper Structure

This paper contains 17 sections, 6 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: We propose spatially-conditioned diffusion (SCD) for consistent human image and video synthesis. Left part: our method can generate content-consistent human videos given a reference human image and target poses. Right part: our method can also be applied to visual try-on to maintain the appearance details of the garment.
  • Figure 2: Overview of the spatially-conditioned diffusion model. Our framework achieves consistent human image and video synthesis by inpainting the desired image under the spatially conditioned reference human image using only the denoising network. A causal feature interaction and reference pose information injection are introduced to further ensure the content consistency between the generated and reference images.
  • Figure 3: Examples of content consistency through spatial conditioning. (a) Example of zero-shot inpainting with a pretrained Stable Diffusion model rombach2022high. (b) Results of the spatially-conditioned diffusion model with different configurations. The simple spatial conditioning strategy can generate consistent visual humans, and the attention layers play a key role in achieving such consistency.
  • Figure 4: Separating the causal spatial conditioning process into the reference feature extraction and target image generation.
  • Figure 5: Qualitative comparison results against state-of-the-art human animation methods on the TikTok datasets wang2023disco. The proposed method can generate high-quality human videos complying with the given pose sequence.
  • ...and 6 more figures