Table of Contents
Fetching ...

PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling

Xiaoyun Zheng, Liwei Liao, Xufeng Li, Jianbo Jiao, Rongjie Wang, Feng Gao, Shiqi Wang, Ronggang Wang

TL;DR

PKU-DyMVHumans addresses the need for high-fidelity dynamic human data to advance reconstruction and photo-realistic rendering. It introduces a dense multi-view dataset with 32 subjects, 45 dynamic scenarios, and 8.2 million frames captured by 56–60 cameras, plus a unified benchmark framework for NeRF-based methods to optimize metrics such as $PSNR$, $SSIM$, and $LPIPS$. The paper benchmarks novel view synthesis, dynamic human modeling, and neural scene decomposition, revealing strengths of hash-encoded NeRFs and challenges from loose clothing, complex motions, and multi-person interactions. The dataset and benchmark provide a practical resource for developing robust dynamic human representations and guide future improvements in multi-view capture and neural rendering.

Abstract

High-quality human reconstruction and photo-realistic rendering of a dynamic scene is a long-standing problem in computer vision and graphics. Despite considerable efforts invested in developing various capture systems and reconstruction algorithms, recent advancements still struggle with loose or oversized clothing and overly complex poses. In part, this is due to the challenges of acquiring high-quality human datasets. To facilitate the development of these fields, in this paper, we present PKU-DyMVHumans, a versatile human-centric dataset for high-fidelity reconstruction and rendering of dynamic human scenarios from dense multi-view videos. It comprises 8.2 million frames captured by more than 56 synchronized cameras across diverse scenarios. These sequences comprise 32 human subjects across 45 different scenarios, each with a high-detailed appearance and realistic human motion. Inspired by recent advancements in neural radiance field (NeRF)-based scene representations, we carefully set up an off-the-shelf framework that is easy to provide those state-of-the-art NeRF-based implementations and benchmark on PKU-DyMVHumans dataset. It is paving the way for various applications like fine-grained foreground/background decomposition, high-quality human reconstruction and photo-realistic novel view synthesis of a dynamic scene. Extensive studies are performed on the benchmark, demonstrating new observations and challenges that emerge from using such high-fidelity dynamic data.

PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling

TL;DR

PKU-DyMVHumans addresses the need for high-fidelity dynamic human data to advance reconstruction and photo-realistic rendering. It introduces a dense multi-view dataset with 32 subjects, 45 dynamic scenarios, and 8.2 million frames captured by 56–60 cameras, plus a unified benchmark framework for NeRF-based methods to optimize metrics such as , , and . The paper benchmarks novel view synthesis, dynamic human modeling, and neural scene decomposition, revealing strengths of hash-encoded NeRFs and challenges from loose clothing, complex motions, and multi-person interactions. The dataset and benchmark provide a practical resource for developing robust dynamic human representations and guide future improvements in multi-view capture and neural rendering.

Abstract

High-quality human reconstruction and photo-realistic rendering of a dynamic scene is a long-standing problem in computer vision and graphics. Despite considerable efforts invested in developing various capture systems and reconstruction algorithms, recent advancements still struggle with loose or oversized clothing and overly complex poses. In part, this is due to the challenges of acquiring high-quality human datasets. To facilitate the development of these fields, in this paper, we present PKU-DyMVHumans, a versatile human-centric dataset for high-fidelity reconstruction and rendering of dynamic human scenarios from dense multi-view videos. It comprises 8.2 million frames captured by more than 56 synchronized cameras across diverse scenarios. These sequences comprise 32 human subjects across 45 different scenarios, each with a high-detailed appearance and realistic human motion. Inspired by recent advancements in neural radiance field (NeRF)-based scene representations, we carefully set up an off-the-shelf framework that is easy to provide those state-of-the-art NeRF-based implementations and benchmark on PKU-DyMVHumans dataset. It is paving the way for various applications like fine-grained foreground/background decomposition, high-quality human reconstruction and photo-realistic novel view synthesis of a dynamic scene. Extensive studies are performed on the benchmark, demonstrating new observations and challenges that emerge from using such high-fidelity dynamic data.
Paper Structure (18 sections, 19 figures, 5 tables)

This paper contains 18 sections, 19 figures, 5 tables.

Figures (19)

  • Figure 1: We present PKU-DyMVHumans, a versatile human-centric dataset designed for high-fidelity reconstruction and rendering of dynamic human performances from dense multi-view videos. It comprises 32 humans across 45 different dynamic scenarios, each featuring highly detailed appearances and complex human motions.
  • Figure 2: Research with PKU-DyMVHumans. It supports various research topics, including neural scene decomposition, novel view synthesis, and dynamic human modeling.
  • Figure 3: Category definition and distribution of the proposed PKU-DyMVHumans dataset.
  • Figure 4: Benchmarks pipeline of PKU-DyMVHumans. Given a multi-view video input, the first step is to extract the frames and estimate the foreground object mask and camera parameters. Specifically, BGMv2 bgmv2 is used to generate the binary foreground object mask. Afterwards, COLMAP SfM sfm is used to estimate camera parameters and generate a sparse point cloud. Using these components, we have constructed three benchmarks. (a) The implementation of NeRF by Instant-NGP requires providing initial camera parameters in JSON file format compatible with the original NeRF codebase. (b) In addition to RGB and mask images, the NeuS implementation expects a camera file that contains a projection matrix and a normalization matrix for each image. (c) We also provide data conversion from NeuS to NeuS2 and Tensor4D format for specifying dynamic scenes.
  • Figure 5: Comparisons on PKU-DyMVHumans dataset for static scene geometry reconstruction and novel view synthesis.
  • ...and 14 more figures