Table of Contents
Fetching ...

CFSynthesis: Controllable and Free-view 3D Human Video Synthesis

Liyuan Cui, Xiaogang Xu, Wenqi Dong, Zesong Yang, Hujun Bao, Zhaopeng Cui

TL;DR

CFSynthesis addresses the challenge of controllable, free-view 3D human video synthesis from a single image by coupling a texture-prior SMPL-based pose representation with a foreground-background separation strategy within a latent diffusion framework. It introduces a textured SMPL sequence to maintain appearance consistency across views and a masking-based foreground/background decoupling to enable user-specified scenes. The method optimizes a subset of diffusion-conditioned components (pose extractor, foreground/background encoders, and cross-attention) while freezing others, achieving high-quality results on TikTok, AIST, and 4D in-the-wild data, and surpassing prior work in free-view and background-insertion tasks. This approach offers practical impact for VR, storytelling, and digital content creation by enabling realistic 3D-human video synthesis with flexible background control from minimal input data.

Abstract

Human video synthesis aims to create lifelike characters in various environments, with wide applications in VR, storytelling, and content creation. While 2D diffusion-based methods have made significant progress, they struggle to generalize to complex 3D poses and varying scene backgrounds. To address these limitations, we introduce CFSynthesis, a novel framework for generating high-quality human videos with customizable attributes, including identity, motion, and scene configurations. Our method leverages a texture-SMPL-based representation to ensure consistent and stable character appearances across free viewpoints. Additionally, we introduce a novel foreground-background separation strategy that effectively decomposes the scene as foreground and background, enabling seamless integration of user-defined backgrounds. Experimental results on multiple datasets show that CFSynthesis not only achieves state-of-the-art performance in complex human animations but also adapts effectively to 3D motions in free-view and user-specified scenarios.

CFSynthesis: Controllable and Free-view 3D Human Video Synthesis

TL;DR

CFSynthesis addresses the challenge of controllable, free-view 3D human video synthesis from a single image by coupling a texture-prior SMPL-based pose representation with a foreground-background separation strategy within a latent diffusion framework. It introduces a textured SMPL sequence to maintain appearance consistency across views and a masking-based foreground/background decoupling to enable user-specified scenes. The method optimizes a subset of diffusion-conditioned components (pose extractor, foreground/background encoders, and cross-attention) while freezing others, achieving high-quality results on TikTok, AIST, and 4D in-the-wild data, and surpassing prior work in free-view and background-insertion tasks. This approach offers practical impact for VR, storytelling, and digital content creation by enabling realistic 3D-human video synthesis with flexible background control from minimal input data.

Abstract

Human video synthesis aims to create lifelike characters in various environments, with wide applications in VR, storytelling, and content creation. While 2D diffusion-based methods have made significant progress, they struggle to generalize to complex 3D poses and varying scene backgrounds. To address these limitations, we introduce CFSynthesis, a novel framework for generating high-quality human videos with customizable attributes, including identity, motion, and scene configurations. Our method leverages a texture-SMPL-based representation to ensure consistent and stable character appearances across free viewpoints. Additionally, we introduce a novel foreground-background separation strategy that effectively decomposes the scene as foreground and background, enabling seamless integration of user-defined backgrounds. Experimental results on multiple datasets show that CFSynthesis not only achieves state-of-the-art performance in complex human animations but also adapts effectively to 3D motions in free-view and user-specified scenarios.

Paper Structure

This paper contains 12 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: CFSynthesis. Given a single reference image, CFSynthesis can synthesize human videos driven by a texture-based SMPL representation derived from 3D pose estimation or generation. It also integrates user-desired scenes as controllable attributes, enabling the generation of lifelike 3D motion videos with varying backgrounds in free-view.
  • Figure 2: An overview of the proposed framework. CFSynthesis first warps an estimated texture map on the given 3D motion sequence and projects it to 2D space through camera pose ${T}^{i}$ to get the SMPL representation ${M}^{i}$. It is then encoded as pose signals $\boldsymbol{z}_{pose}$. The foreground and background are separately encoded as $\boldsymbol{z}_{fg}$ and $\boldsymbol{z}_{bg}$, respectively, and are recomposed during the decoder stage using a masking mechanism. These components collaboratively guide the original latent code $\boldsymbol{z}_{0}$ for the target frame. In the U-Net architecture, the training/frozen strategy is uniform across all layers, and here we only illustrate the first layer.
  • Figure 3: Implementation of the Masking Mechanism and Pose Extractor. We visualize the operation of the masking mechanism and observe that features in the foreground region tend to diffuse toward the edges and overflow after the first layer of spatial attention. To mitigate this issue, we refine the foreground features using the downsampled $f_l^{seg}$. In the pose extractor, self-attention effectively captures structured information in SMPL representation, including facial features, torso details, and clothing textures.
  • Figure 4: Qualitative comparisons between our approach and state-of-the-art methods on the TikTok dataset. We annotate the control conditions in the bottom right corner. The SMPL representation provides robust priors that ensure the best reliability of appearance quality.
  • Figure 5: Qualitative comparison with state-of-the-art methods on the AIST dataset. Our approach demonstrates the best quality in preserving both the fidelity and consistency of character appearance across 360-degree views.
  • ...and 5 more figures