Table of Contents
Fetching ...

Synthesizing Moving People with 3D Control

Boyi Li, Junming Chen, Jathushan Rajasegaran, Yossi Gandelsman, Alexei A. Efros, Jitendra Malik

TL;DR

This work introduces 3DHM, a two-stage diffusion framework that animates a new person from a single image to match a target 3D motion sequence. Stage-1 performs texture-map inpainting to recover a complete UV texture, while Stage-2 renders realistic clothing- and hair-rich appearances under 3D pose control, trained with self-supervised data. The combination of texture priors, 3D control, and appearance alignment enables faithful pose propagation across long sequences and varied viewpoints, outperforming prior methods on image- and video-level metrics and pose fidelity. The approach advances real-world moving-human synthesis with minimal input data and broad generalization potential for animating arbitrary individuals.

Abstract

In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.

Synthesizing Moving People with 3D Control

TL;DR

This work introduces 3DHM, a two-stage diffusion framework that animates a new person from a single image to match a target 3D motion sequence. Stage-1 performs texture-map inpainting to recover a complete UV texture, while Stage-2 renders realistic clothing- and hair-rich appearances under 3D pose control, trained with self-supervised data. The combination of texture priors, 3D control, and appearance alignment enables faithful pose propagation across long sequences and varied viewpoints, outperforming prior methods on image- and video-level metrics and pose fidelity. The approach advances real-world moving-human synthesis with minimal input data and broad generalization potential for animating arbitrary individuals.

Abstract

In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.
Paper Structure (19 sections, 8 figures, 4 tables)

This paper contains 19 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The Imitation Game: Given a video of a person "The Actor", we want to transfer their motion to a new person "The Imitator". In this figure, the first row shows a sequence of frames of the actor from a ballerina Dance of the Sugar Plum Fairy. The inset row shows the 3D poses extracted from this video. Now, given any single image of a new person The Imitator, our model can synthesize new renderings of the imitator, to copy the actions of the actor in 3D.
  • Figure 2: Overview of 3DHM: we show an overview of our model pipeline. Given an image of the imitator and a sequence of 3D poses from the actor, we first generate a complete full texture map of the imitator, which can be applied to the 3D pose sequences extracted from the actor to generate texture-mapped intermediate renderings of the imitator. Then we pass these intermediate renderings to the Stage-2 model to project the SMPL mesh rendering to more realistic renderings of real images.
  • Figure 3: Stage-1 of 3DHM: In the first stage, given a single view image of an imitator, we first apply 4Dhumans goel2023humans style sampling approach to extract partial texture map and its corresponding visibility map. These two inputs are passed to the in-painting diffusion model to generate a plausible complete texture map. In this example, while we only see the front view of the imitator, the model was able to hallucinate a plausible back region that is consistent with their clothing.
  • Figure 4: Stage-2 of 3DHM: Given an intermediate rendering of the imitator with the pose of the actor and the actual RGB image of the imitator, our model can synthesize realistic renderings of the imitator on the pose of the actor.
  • Figure 5: Scaled up Stage-2 of 3DHM Model: To enable consistent background and human generation, we train ReferenceNet with ControlNet, and then only finetune the temporal-attention layer of the UNet and keep other components frozen.
  • ...and 3 more figures