Table of Contents
Fetching ...

UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair

Chuanrui Zhang, Yingshuang Zou, ZhengXian Wu, Yonggen Ling, Yuxiao Yang, Ziwei Wang

Abstract

Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.

UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair

Abstract

Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.
Paper Structure (29 sections, 11 equations, 14 figures, 10 tables)

This paper contains 29 sections, 11 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: UniPR enables efficient full-scene processing and produces real-scale, physically consistent object shapes by leveraging geometric constraints, outperforming traditional image-to-3D reconstruction models. As the first fully end-to-end method, UniPR achieves up to 100× faster generation and delivers 3x improvements in shape-proportion accuracy compared with SOTA mehtods.
  • Figure 2: Comparison between our end-to-end approach and the classical pipeline. Our method enables information to flow seamlessly across all components, allowing the network to leverage the full-image context for shape reconstruction. This end-to-end design effectively handles occlusion and significantly improves the preservation of true shape proportions compared to classical, modular pipelines.
  • Figure 3: Overview of Our Proposed UniPR. We present UniPR, a single-forward network capable of simultaneously processing multiple unknown objects. Taking stereo image pairs as input, UniPR first encodes the scene into Tri-Plane View features that comprehensively capture spatial and geometric information. Within the transformer decoder, object queries are employed to extract instance-specific features from these TPV embeddings, enabling the network to reason about multiple objects in parallel. The resulting object embeddings are then fed into specialized prediction heads to infer each object’s semantic label, 3D position, physical scale, and pose-aware shape representation.
  • Figure 4: Qualitative shape reconstruction results compared with image-to-3D models. The results demonstrate the accurate preservation of shape proportions achieved by our proposed UniPR across various objects in the LVS6D dataset.
  • Figure 5: Qualitative pose-aware shape reconstruction results on LVS6D dataset. The results highlight the key role of PASR in simplifying rotation prediction, as it eliminates the ambiguity caused by different canonical definitions for categories with similar geometry.
  • ...and 9 more figures