Human Video Generation from a Single Image with 3D Pose and View Control

Tiantian Wang; Chun-Han Yao; Tao Hu; Mallikarjun Byrasandra Ramalinga Reddy; Ming-Hsuan Yang; Varun Jampani

Human Video Generation from a Single Image with 3D Pose and View Control

Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani

TL;DR

HVG is presented, a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control and outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

Abstract

Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

Human Video Generation from a Single Image with 3D Pose and View Control

TL;DR

Abstract

Paper Structure (12 sections, 1 equation, 21 figures, 3 tables)

This paper contains 12 sections, 1 equation, 21 figures, 3 tables.

Introduction
Related Work
Methodology
Dual-Dimensional Bone Map
Network Architecture
Progressive Spatio-Temporal Sampling
Experiments
Novel View Synthesis
Novel View and Novel Pose Synthesis
Ablation Study
Failure Case
Conclusion

Figures (21)

Figure 1: HVG Overview. HVG is capable of generating consistent multi-view human videos from a single image, conditioned on the given multi-view pose sequences and camera poses.
Figure 2: Framework of HVG. The bone map sequence is processed by the pose modulator, followed by the denoising process of DenoisingNet, which integrates camera parameters via camera embedding and time steps to generate a multi-view human video. DenoisingNet consists of convolutional blocks, spatial attention, view attention, and temporal attention to capture temporal and spatial correspondences. The reference image contributes in three key ways: First, ReferenceNet extracts fine-grained details to enhance spatial attention. Second, semantic features are captured through the CLIP Encoder for convolutional blocks and view attention and are fused with multi-frame noise. Third, the VAE Encoder processes reference image features for temporal attention.
Figure 3: Human position alignment. The left and right figures display the human subject across different views, both without and with alignment. After alignment, the human subject is positioned consistently in the same location across views.
Figure 4: Illustration of spatio-temporal sampling. To generate a multi-view long video, we divide the sequence into overlapping segments along both the temporal and view dimensions, denoted as {$\mathcal{ST}_t^i$$|i=1,2,\ldots\}$ and {$\mathcal{SV}_t^j$$|j=1,2,\ldots\}$. These segments are independently aggregated to form long-range latent representations $\mathbf{z}_{t}^{\text{ST}}$ and $\mathbf{z}_{t}^{\text{SV}}$ in the temporal and view dimensions at each timestep $t$. At each timestep $t$, the denoised temporal latent $\mathbf{z}_{t}^{\text{ST}}$ and view latent $\mathbf{z}_{t}^{\text{SV}}$ are combined through a learned weighting strategy to produce the updated latent feature $\mathbf{z}_{t-1}$. Repeating this denoising process until $t=1$ yields $\mathbf{z}_0$, which is then decoded to synthesize the final long multi-view human video.
Figure 5: Novel-view results from single-view images. HVG (ours) generates novel-view images with higher anatomic fidelity, faithfulness to input images, and consistency of geometry and texture.
...and 16 more figures

Human Video Generation from a Single Image with 3D Pose and View Control

TL;DR

Abstract

Human Video Generation from a Single Image with 3D Pose and View Control

Authors

TL;DR

Abstract

Table of Contents

Figures (21)