Table of Contents
Fetching ...

HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes

Yichen Yao, Zimo Jiang, Yujing Sun, Zhencai Zhu, Xinge Zhu, Runnan Chen, Yuexin Ma

TL;DR

This work proposes an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes by introducing novel modules for effective instance-to-scene representation transfer and synthetic-to-real feature alignment.

Abstract

Human-centric 3D scene understanding has recently drawn increasing attention, driven by its critical impact on robotics. However, human-centric real-life scenarios are extremely diverse and complicated, and humans have intricate motions and interactions. With limited labeled data, supervised methods are difficult to generalize to general scenarios, hindering real-life applications. Mimicking human intelligence, we propose an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes. To bridge the gap between the distinct data representations and feature distributions of synthetic models and real point clouds, we introduce novel modules for effective instance-to-scene representation transfer and synthetic-to-real feature alignment. Remarkably, our method exhibits superior performance compared to current state-of-the-art techniques, achieving 87.8% improvement in mAP and closely approaching the performance of fully supervised methods (62.15 mAP vs. 69.02 mAP) on HuCenLife Dataset.

HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes

TL;DR

This work proposes an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes by introducing novel modules for effective instance-to-scene representation transfer and synthetic-to-real feature alignment.

Abstract

Human-centric 3D scene understanding has recently drawn increasing attention, driven by its critical impact on robotics. However, human-centric real-life scenarios are extremely diverse and complicated, and humans have intricate motions and interactions. With limited labeled data, supervised methods are difficult to generalize to general scenarios, hindering real-life applications. Mimicking human intelligence, we propose an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes. To bridge the gap between the distinct data representations and feature distributions of synthetic models and real point clouds, we introduce novel modules for effective instance-to-scene representation transfer and synthetic-to-real feature alignment. Remarkably, our method exhibits superior performance compared to current state-of-the-art techniques, achieving 87.8% improvement in mAP and closely approaching the performance of fully supervised methods (62.15 mAP vs. 69.02 mAP) on HuCenLife Dataset.
Paper Structure (19 sections, 4 equations, 6 figures, 5 tables)

This paper contains 19 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Human has the ability to identify objects in 3D scenes, relying merely on their understanding of the objects' shapes and sizes. We aspire for machines to possess the capability to perform 3D perception solely based on synthetic models, independent of any scene-level annotations.
  • Figure 2: Pipeline of our method. "PC", "RI", "RC" stand for Point Cloud, Range Image, Receptive Control, respectively. The individuals painted with yellow represent real humans, while those with pink represent synthetic humans. For stage1, we introduce range image bridged insertion, a module insert parametric model into existing dataset to create our natural synthetic data. We train our detector on the data to produce initial pseudo-labels. In stage 2, we employ unsupervised bi-directional filter to improve the quality of pseudo-label. Then, Synthetic-to-real feature alignment is applied to enhance the generalize ability of our detector to real human. During stage 3, we utilize human structural knowledge to boost the performance of the model. Finally, based on the obtained high-quality pseudo-labels, fine-tuning is used to make the model totally converge to identify real humans.
  • Figure 3: Detection visualization. The first and second row demonstrate results on HuCenLifeXu2023HumancentricSU. The third and forth row show results on STCrowdCong2022STCrowdAM.
  • Figure 4: Visualization of real data and synthetic data.
  • Figure 5: Visualization of few synthetic human actions.
  • ...and 1 more figures