CFSynthesis: Controllable and Free-view 3D Human Video Synthesis
Liyuan Cui, Xiaogang Xu, Wenqi Dong, Zesong Yang, Hujun Bao, Zhaopeng Cui
TL;DR
CFSynthesis addresses the challenge of controllable, free-view 3D human video synthesis from a single image by coupling a texture-prior SMPL-based pose representation with a foreground-background separation strategy within a latent diffusion framework. It introduces a textured SMPL sequence to maintain appearance consistency across views and a masking-based foreground/background decoupling to enable user-specified scenes. The method optimizes a subset of diffusion-conditioned components (pose extractor, foreground/background encoders, and cross-attention) while freezing others, achieving high-quality results on TikTok, AIST, and 4D in-the-wild data, and surpassing prior work in free-view and background-insertion tasks. This approach offers practical impact for VR, storytelling, and digital content creation by enabling realistic 3D-human video synthesis with flexible background control from minimal input data.
Abstract
Human video synthesis aims to create lifelike characters in various environments, with wide applications in VR, storytelling, and content creation. While 2D diffusion-based methods have made significant progress, they struggle to generalize to complex 3D poses and varying scene backgrounds. To address these limitations, we introduce CFSynthesis, a novel framework for generating high-quality human videos with customizable attributes, including identity, motion, and scene configurations. Our method leverages a texture-SMPL-based representation to ensure consistent and stable character appearances across free viewpoints. Additionally, we introduce a novel foreground-background separation strategy that effectively decomposes the scene as foreground and background, enabling seamless integration of user-defined backgrounds. Experimental results on multiple datasets show that CFSynthesis not only achieves state-of-the-art performance in complex human animations but also adapts effectively to 3D motions in free-view and user-specified scenarios.
