NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

Jingrui Yu; Dipankar Nandi; Roman Seidel; Gangolf Hirtz

NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

Jingrui Yu, Dipankar Nandi, Roman Seidel, Gangolf Hirtz

TL;DR

This work tackles the lack of large-scale top-view fisheye datasets for human pose estimation by introducing NToP, a NeRF-powered pipeline that converts existing 2D/3D datasets into semi-synthetic, top-view data with groundtruth 2D and 3D keypoints. The authors render over 570K images (NToP570K) using virtual fisheye cameras and provide OmniLab, a real-world top-view dataset, to validate cross-domain performance. Finetuning ViTPose-B on NToP-train boosts 2D AP by 33.3% on the NToP validation set, while HybrIK-Transformer finetuned on NToP-train achieves a substantial PA-MPJPE reduction of 53.7 mm for 3D HPE, demonstrating strong cross-domain gains and the utility of semi-synthetic data. The results indicate that NToP improves top-view HPE performance and reduces domain gaps relative to existing datasets, with potential extensions to multi-view, temporal, and more efficient NeRF-based rendering.

Abstract

Human pose estimation (HPE) in the top-view using fisheye cameras presents a promising and innovative application domain. However, the availability of datasets capturing this viewpoint is extremely limited, especially those with high-quality 2D and 3D keypoint annotations. Addressing this gap, we leverage the capabilities of Neural Radiance Fields (NeRF) technique to establish a comprehensive pipeline for generating human pose datasets from existing 2D and 3D datasets, specifically tailored for the top-view fisheye perspective. Through this pipeline, we create a novel dataset NToP570K (NeRF-powered Top-view human Pose dataset for fisheye cameras with over 570 thousand images), and conduct an extensive evaluation of its efficacy in enhancing neural networks for 2D and 3D top-view human pose estimation. A pretrained ViTPose-B model achieves an improvement in AP of 33.3 % on our validation set for 2D HPE after finetuning on our training set. A similarly finetuned HybrIK-Transformer model gains 53.7 mm reduction in PA-MPJPE for 3D HPE on the validation set.

NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 10 figures, 7 tables)

This paper contains 22 sections, 9 equations, 10 figures, 7 tables.

Introduction
Related Work
2D and 3D Human Pose Estimation with Deep Learning
Top-View Human Pose Estimation Algorithms and Datasets
NeRF and Human-Centric NeRF Variants
NToP Data Generation Pipeline
NeRF Model Training
Fisheye Camera Model and Omnidirectional Rendering
Groundtruth Keypoint Annotation
NToP Dataset and OmniLab Dataset
Origin Datasets and Rendering Parameters of NToP
OmniLab Dataset
Dataset statistics and comparison
Dataset Validation
2D Pose Estimation with ViTPose
...and 7 more sections

Figures (10)

Figure 1: The NToP pipeline. We input images and their corresponding segmentation masks, groundtruth poses and camera parameters to HumanNeRF without the pose correction. After training, virtual fisheye cameras are positioned on top of the human model to render top-view images of a novel 3D pose. 2D groundtruth keypoint annotations are generated in post-processing.
Figure 2: (a) The equidistant projection model. $C$ is the camera center, $(\boldsymbol{X}_\mathbf{C},\boldsymbol{Y}_\mathbf{C},\boldsymbol{Z}_\mathbf{C})$ is the CCS. (b) Distribution of the ray cross points $\mathbf{q}$ on the image plane for a demonstrative $50\times 50$ pixel omnidirectional render. Extreme far points are not plotted.
Figure 3: Examples from OmniLab dataset. Actions: (a) brooming, (b) getting up from ground, (c) pulling object, and (d) sitting down and standing up.
Figure 4: Dataset examples: (a) PanopTOP31K, (b) THEODORE+, (c) ntopH36M, (d,e) ntopGB, (f) ntopZJU. The subjects are resized to roughly the same size to showcase the difference in render quality.
Figure 5: (a) Stripe Artifacts due to non-linear ray distribution in omnidirectional rendering: actor p393 from zjumocap subset, frame 75, $h=1.0$, $R=0.5$, camera position NW. (b-d) Artifacts caused by incorrect modelling in genebody subset: (b) The keyboard is broken in the middle. (c) Parts of the guitar are missing. (d) The basketball is shown on both hands.
...and 5 more figures

NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

TL;DR

Abstract

NToP: NeRF-Powered Large-scale Dataset Generation for 2D and 3D Human Pose Estimation in Top-View Fisheye Images

Authors

TL;DR

Abstract

Table of Contents

Figures (10)