Table of Contents
Fetching ...

MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis

Yihao Zhi, Chenghong Li, Hongjie Liao, Xihe Yang, Zhengwentai Sun, Jiahao Chang, Xiaodong Cun, Wensen Feng, Xiaoguang Han

TL;DR

MV-Performer addresses 360-degree, synchronized 4D human novel view synthesis from monocular video. It introduces depth-based geometric conditioning and a camera-dependent normal map cue within a multi-view diffusion framework built on WAN 2.1 and MVHumanNet, paired with robust monocular depth refinement for in-the-wild data. The approach achieves state-of-the-art fidelity and cross-view consistency on MVHumanNet and DNA-Rendering, and demonstrates utility as a generative prior for downstream tasks. The work advances the practical synthesis of human-centric 4D content by integrating explicit geometric cues, synchronization mechanisms, and a robust inference pipeline. These contributions enable immersive applications in VR/AR, free-viewpoint video, and synthetic data generation, with strong potential for broader adoption in 4D human synthesis benchmarks.

Abstract

Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer's state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.

MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis

TL;DR

MV-Performer addresses 360-degree, synchronized 4D human novel view synthesis from monocular video. It introduces depth-based geometric conditioning and a camera-dependent normal map cue within a multi-view diffusion framework built on WAN 2.1 and MVHumanNet, paired with robust monocular depth refinement for in-the-wild data. The approach achieves state-of-the-art fidelity and cross-view consistency on MVHumanNet and DNA-Rendering, and demonstrates utility as a generative prior for downstream tasks. The work advances the practical synthesis of human-centric 4D content by integrating explicit geometric cues, synchronization mechanisms, and a robust inference pipeline. These contributions enable immersive applications in VR/AR, free-viewpoint video, and synthetic data generation, with strong potential for broader adoption in 4D human synthesis benchmarks.

Abstract

Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer's state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.

Paper Structure

This paper contains 22 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (i) The depth warping condition at the rear viewpoints presents ambiguity for the model. (ii) Inaccurate monocular depth produce floater-like rendering when there is a significant change in viewpoint.
  • Figure 2: The overview of our MV-Performer. "SA"and "CA" are abbreviations for self-attention and cross-attention, respectively. We first estimate the depth and normal from Sapiens khirodkar2024sapiens and then use these estimates to refine the noisy point cloud output from MegaSaM li2024_megasam. Next, we render the refined point cloud with corresponding colors to novel views as geometric conditions. Finally, we feed them into MV-Performer to synthesize a 4D human video from novel viewpoints.
  • Figure 3: Our proposed camera-dependent normal condition assists the model in distinguishing between observed and unobserved condition information, resulting in a more accurate 360-degree synthesis.
  • Figure 4: The syncronization attention largely enhance the generation consistency across views.
  • Figure 5: The initial estimated point clouds contain floaters near the edges of the character, leading to bad guidance to the video diffusion model. In contrast, our method achieves clean estimations and yields pleasing results.
  • ...and 3 more figures