Table of Contents
Fetching ...

AMG: Avatar Motion Guided Video Generation

Zhangsihao Yang, Mengyi Shan, Mohammad Farazi, Wenhui Zhu, Yanxi Chen, Xuanzhao Dong, Yalin Wang

TL;DR

AMG addresses the challenge of realistic, controllable human video generation by uniting 2D pre-trained diffusion models with 3D avatar-based motion control. It introduces a data-processing pipeline that extracts 3D motion and camera information from 2D videos to render avatar sequences, which are used to condition a pre-trained text-to-video diffusion model via LoRA fine-tuning. The approach enables multi-person video generation with precise control over camera position, human motion, and background style, outperforming pose- or driving-video conditioned baselines in realism and adaptability. This work advances practical controllable video synthesis with potential applications in VR/AR, film, and interactive media by combining rich 3D information with strong 2D priors.

Abstract

Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and sensitivity to visual artifacts. The extensively studied 2D media generation methods take advantage of massive human media datasets, but struggle with 3D-aware control; whereas 3D avatar-based approaches, while offering more freedom in control, lack photorealism and cannot be harmonized seamlessly with background scene. We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. We additionally introduce a novel data processing pipeline that reconstructs and renders human avatar movements from dynamic camera videos. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style. We also demonstrate through extensive evaluation that it outperforms existing human video generation methods conditioned on pose sequences or driving videos in terms of realism and adaptability.

AMG: Avatar Motion Guided Video Generation

TL;DR

AMG addresses the challenge of realistic, controllable human video generation by uniting 2D pre-trained diffusion models with 3D avatar-based motion control. It introduces a data-processing pipeline that extracts 3D motion and camera information from 2D videos to render avatar sequences, which are used to condition a pre-trained text-to-video diffusion model via LoRA fine-tuning. The approach enables multi-person video generation with precise control over camera position, human motion, and background style, outperforming pose- or driving-video conditioned baselines in realism and adaptability. This work advances practical controllable video synthesis with potential applications in VR/AR, film, and interactive media by combining rich 3D information with strong 2D priors.

Abstract

Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and sensitivity to visual artifacts. The extensively studied 2D media generation methods take advantage of massive human media datasets, but struggle with 3D-aware control; whereas 3D avatar-based approaches, while offering more freedom in control, lack photorealism and cannot be harmonized seamlessly with background scene. We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. We additionally introduce a novel data processing pipeline that reconstructs and renders human avatar movements from dynamic camera videos. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style. We also demonstrate through extensive evaluation that it outperforms existing human video generation methods conditioned on pose sequences or driving videos in terms of realism and adaptability.
Paper Structure (17 sections, 8 equations, 13 figures, 1 table)

This paper contains 17 sections, 8 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Our proposed method generates realistic human videos given a single text prompt. We enables diverse controls by explicitly incorporating the rendering of a 3D human avatar as conditional signal while fine-tuning a pre-trained video model. We specially achieves generation with (a) novel motion by generating control motion sequences with various text prompts, (b) free camera viewpoints by simulating camera movements while rendering, and (c) novel scene by describing background in the prompt.
  • Figure 2: Our method consists of two stages: training data generation and video-conditional finetuning. In the left column, we visualize key steps in our data generation pipeline. We begin by (a) detecting and reconstructing SMPL using TRACE; then (b) using LLaVA to generate textual descriptions that capture both the subjects' appearance and their interaction with the environment; and finally (c) rendering an avatar video using HumanGaussian, based on motion and camera from (a) and appearance from (b). In the right column, we illustrate how the synthetic human avatar video is used to condition the fine-tuning process by leveraging the input video condition and LoRA.
  • Figure 3: Video with explicit camera movement control. The left column shows a zoom-in sequence, and the right column shows a zoom-out sequence. Each column pairs the input rendered avatar video on the left with the generated video from our method on the right.
  • Figure 4: Video with user-defined human motion control. Given an action prompt, we start with animating the pre-generated human avatar by motions generated from the text. We render the motions for a specific camera angle (left columns in each pair), and feed that as a condition to our video model to generate photorealistic human videos. Our model is able to generate videos of various novel activities that are out of distribution of the original training data.
  • Figure 5: Video with background changes. First column's left upper corner presents the same character avatar rendering, and the rest shows the result video with different prompts describing the scene.
  • ...and 8 more figures