Table of Contents
Fetching ...

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

Jiong Wang, Fengyu Yang, Wenbo Gou, Bingliang Li, Danqi Yan, Ailing Zeng, Yijun Gao, Junle Wang, Yanqing Jing, Ruimao Zhang

TL;DR

FreeMan addresses the gap between laboratory 3D HPE datasets and real-world conditions by introducing a large-scale, multi-view dataset captured with 8 smartphones and supported by a semi-automatic annotation pipeline. It provides benchmarks across monocular, lifting, multi-view, and neural rendering tasks, and demonstrates that models trained on FreeMan generalize better to real-world scenarios than those trained on traditional datasets. The work also details a practical toolchain for calibration, synchronization, and error correction, and shows that FreeMan yields meaningful transfer benefits for state-of-the-art methods while highlighting remaining challenges in real-world rendering and occlusion handling. Overall, FreeMan offers a valuable resource to drive robust 3D pose estimation and rendering in unconstrained environments with real-world impact for AR/VR, HRI, and animation.

Abstract

Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction, serving as a crucial technique for understanding and interacting with human actions in real-world settings. However, the current datasets, often collected under single laboratory conditions using complex motion capture equipment and unvarying backgrounds, are insufficient. The absence of datasets on variable conditions is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, multi-view dataset collected under the real-world conditions. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an semi-automated pipeline containing error detection to reduce the workload of manual check and ensure precise annotation. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. Code and data are available at https://wangjiongw.github.io/freeman.

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

TL;DR

FreeMan addresses the gap between laboratory 3D HPE datasets and real-world conditions by introducing a large-scale, multi-view dataset captured with 8 smartphones and supported by a semi-automatic annotation pipeline. It provides benchmarks across monocular, lifting, multi-view, and neural rendering tasks, and demonstrates that models trained on FreeMan generalize better to real-world scenarios than those trained on traditional datasets. The work also details a practical toolchain for calibration, synchronization, and error correction, and shows that FreeMan yields meaningful transfer benefits for state-of-the-art methods while highlighting remaining challenges in real-world rendering and occlusion handling. Overall, FreeMan offers a valuable resource to drive robust 3D pose estimation and rendering in unconstrained environments with real-world impact for AR/VR, HRI, and animation.

Abstract

Estimating the 3D structure of the human body from natural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction, serving as a crucial technique for understanding and interacting with human actions in real-world settings. However, the current datasets, often collected under single laboratory conditions using complex motion capture equipment and unvarying backgrounds, are insufficient. The absence of datasets on variable conditions is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation, we present FreeMan, the first large-scale, multi-view dataset collected under the real-world conditions. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences, viewed from different perspectives. These sequences cover 40 subjects across 10 different scenarios, each with varying lighting conditions. We have also established an semi-automated pipeline containing error detection to reduce the workload of manual check and ensure precise annotation. We provide comprehensive evaluation baselines for a range of tasks, underlining the significant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. Code and data are available at https://wangjiongw.github.io/freeman.
Paper Structure (36 sections, 16 figures, 10 tables)

This paper contains 36 sections, 16 figures, 10 tables.

Figures (16)

  • Figure 1: The left displays sample frames from Human3.6M h36m_pami and HuMMan cai2022humman, which were collected under laboratory conditions, and contrasted with our FreeMan dataset that was collected in real-world scenarios. Frames from FreeMan have been cropped into a square format for visualization purposes, with the original resolution being $1920 \times 1080$ pixels. The right-hand side demonstrates the test results on 3DPW of the HMR modelhmrKanazawa17 trained on these three datasets. Notably, the model trained using FreeMan is able to adapt flawlessly to real-world conditions, demonstrating its superior generalization ability. Visualization uses implementation of mmHuman3D mmhuman3d.
  • Figure 2: Equipment setting of data collection using 8 cameras. Cameras are attached to tripods.
  • Figure 3: (a) Distribution of distance from the camera to the center of the system, indicated by translation along the z-axis in camera parameters. Four vertical red lines represent the distance of 4 cameras in Human3.6M h36m_pami. (b) Distribution of human bounding box areas. The horizontal axis represents the ratio of the bounding box area over the image area. The vertical axis is in log scale. (c) Correspondence of scenes and actions. Areas of blocks represent the scale of the respective frame number. The outmost circle shows actions and the circle in the middle present $10$ type of scenes in our dataset. Zoom in $10 \times$ for the best view.
  • Figure 4: The diverse frames in FreeMan. The topmost two rows presents a range of indoor and outdoor scenes, highlighting human-object interactions and the diversity of scene contexts, lighting conditions, and subjects. The third row exhibits frames from different views. The final row illustrates the temporal variation of human poses from a consistent viewpoint, emphasizing the dynamism of motion capture.
  • Figure 5: The illustration of data collection and annotation toolchain: (a) depicts the transmission of signals between cameras and servers for camera calibration, where chessboard frames are sent to the server, and camera parameters are returned. (b) demonstrates the synchronization process among devices. (c) showcases the pipeline for pose annotation.
  • ...and 11 more figures