Table of Contents
Fetching ...

Human Video Generation from a Single Image with 3D Pose and View Control

Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani

TL;DR

HVG is presented, a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control and outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

Abstract

Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

Human Video Generation from a Single Image with 3D Pose and View Control

TL;DR

HVG is presented, a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control and outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

Abstract

Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.
Paper Structure (12 sections, 1 equation, 21 figures, 3 tables)

This paper contains 12 sections, 1 equation, 21 figures, 3 tables.

Figures (21)

  • Figure 1: HVG Overview. HVG is capable of generating consistent multi-view human videos from a single image, conditioned on the given multi-view pose sequences and camera poses.
  • Figure 2: Framework of HVG. The bone map sequence is processed by the pose modulator, followed by the denoising process of DenoisingNet, which integrates camera parameters via camera embedding and time steps to generate a multi-view human video. DenoisingNet consists of convolutional blocks, spatial attention, view attention, and temporal attention to capture temporal and spatial correspondences. The reference image contributes in three key ways: First, ReferenceNet extracts fine-grained details to enhance spatial attention. Second, semantic features are captured through the CLIP Encoder for convolutional blocks and view attention and are fused with multi-frame noise. Third, the VAE Encoder processes reference image features for temporal attention.
  • Figure 3: Human position alignment. The left and right figures display the human subject across different views, both without and with alignment. After alignment, the human subject is positioned consistently in the same location across views.
  • Figure 4: Illustration of spatio-temporal sampling. To generate a multi-view long video, we divide the sequence into overlapping segments along both the temporal and view dimensions, denoted as {$\mathcal{ST}_t^i$$|i=1,2,\ldots\}$ and {$\mathcal{SV}_t^j$$|j=1,2,\ldots\}$. These segments are independently aggregated to form long-range latent representations $\mathbf{z}_{t}^{\text{ST}}$ and $\mathbf{z}_{t}^{\text{SV}}$ in the temporal and view dimensions at each timestep $t$. At each timestep $t$, the denoised temporal latent $\mathbf{z}_{t}^{\text{ST}}$ and view latent $\mathbf{z}_{t}^{\text{SV}}$ are combined through a learned weighting strategy to produce the updated latent feature $\mathbf{z}_{t-1}$. Repeating this denoising process until $t=1$ yields $\mathbf{z}_0$, which is then decoded to synthesize the final long multi-view human video.
  • Figure 5: Novel-view results from single-view images. HVG (ours) generates novel-view images with higher anatomic fidelity, faithfulness to input images, and consistency of geometry and texture.
  • ...and 16 more figures