Table of Contents
Fetching ...

AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors

Xiaozhen Qiao, Wenjia Wang, Zhiyuan Zhao, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li

TL;DR

AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180$\times$ faster than optimization-based approaches.

Abstract

Reconstructing 3D humans from images captured at multiple perspectives typically requires pre-calibration, like using checkerboards or MVS algorithms, which limits scalability and applicability in diverse real-world scenarios. In this work, we present \textbf{AHAP} (Reconstructing \textbf{A}rbitrary \textbf{H}umans from \textbf{A}rbitrary \textbf{P}erspectives), a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives without requiring camera calibration. Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization. Specifically, we use a Cross-View Identity Association module through learnable person queries and soft assignment, supervised by contrastive learning to resolve cross-view human identity association. A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency. Additionally, multi-view geometry eliminates the depth ambiguity inherent in monocular methods, providing more precise 3D human localization through multi-view triangulation. Experiments on EgoHumans and EgoExo4D demonstrate that AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180$\times$ faster than optimization-based approaches.

AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors

TL;DR

AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180 faster than optimization-based approaches.

Abstract

Reconstructing 3D humans from images captured at multiple perspectives typically requires pre-calibration, like using checkerboards or MVS algorithms, which limits scalability and applicability in diverse real-world scenarios. In this work, we present \textbf{AHAP} (Reconstructing \textbf{A}rbitrary \textbf{H}umans from \textbf{A}rbitrary \textbf{P}erspectives), a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives without requiring camera calibration. Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization. Specifically, we use a Cross-View Identity Association module through learnable person queries and soft assignment, supervised by contrastive learning to resolve cross-view human identity association. A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency. Additionally, multi-view geometry eliminates the depth ambiguity inherent in monocular methods, providing more precise 3D human localization through multi-view triangulation. Experiments on EgoHumans and EgoExo4D demonstrate that AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180 faster than optimization-based approaches.
Paper Structure (35 sections, 20 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 35 sections, 20 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: (a) AHAP achieves 180$\times$ speedup over optimization-based HSfM muller2025reconstructing while maintaining competitive accuracy. (b) Results on EgoHumans. (c) Results on EgoExo4D.
  • Figure 2: Overall pipeline of AHAP. Given multi-view images, the scene encoder lin2025depth estimates scene geometry and camera poses, while the human encoder baradel2024multi extracts human-centric features. Our cross-view identity association module matches the same person across views via learnable queries. The human head fuses scene tokens, aggregated tokens, and reference view tokens through a multi-view fusion decoder to predict SMPL parameters. Finally, we align humans and scene point clouds via scale alignment and multi-view triangulation for precise human localization.
  • Figure 3: PCA visualization of feature distributions. (a-d) Scene encoder lin2025depth features; (e-h) Human encoder baradel2024multi features. Human encoder features show stronger semantic clustering aligned with ground truth, while scene encoder provides complementary geometric information.
  • Figure 4: Multi-view triangulation for human position refinement. For persons visible in multiple views, we refine their 3D positions using DLT triangulation based on 2D pelvis observations and estimated camera poses, improving human-scene alignment.
  • Figure 5: Qualitative results. Visualization of human-scene reconstruction on EgoHumans and EgoExo4D. AHAP produces accurate human meshes within reconstructed scenes, maintaining consistent identity association across views.
  • ...and 4 more figures