Table of Contents
Fetching ...

Real3D: Scaling Up Large Reconstruction Models with Real-World Images

Hanwen Jiang, Qixing Huang, Georgios Pavlakos

TL;DR

Real3D tackles the data bottleneck in single-view 3D reconstruction by training large reconstruction models on real-world single-view images. It introduces a self-training framework with pixel-level cycle-consistency and CLIP-based semantic guidance, paired with automatic data curation to select unoccluded instances. The approach blends synthetic multi-view supervision with real-image self-training and demonstrates consistent improvements across diverse real and synthetic benchmarks. The work highlights scalability and generalization potential for 3D foundation models in AR/VR and AIGC applications.

Abstract

The default strategy for training single-view Large Reconstruction Models (LRMs) follows the fully supervised route using large-scale datasets of synthetic 3D assets or multi-view captures. Although these resources simplify the training procedure, they are hard to scale up beyond the existing datasets and they are not necessarily representative of the real distribution of object shapes. To address these limitations, in this paper, we introduce Real3D, the first LRM system that can be trained using single-view real-world images. Real3D introduces a novel self-training framework that can benefit from both the existing synthetic data and diverse single-view real images. We propose two unsupervised losses that allow us to supervise LRMs at the pixel- and semantic-level, even for training examples without ground-truth 3D or novel views. To further improve performance and scale up the image data, we develop an automatic data curation approach to collect high-quality examples from in-the-wild images. Our experiments show that Real3D consistently outperforms prior work in four diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes. Code and model can be found here: https://hwjiang1510.github.io/Real3D/

Real3D: Scaling Up Large Reconstruction Models with Real-World Images

TL;DR

Real3D tackles the data bottleneck in single-view 3D reconstruction by training large reconstruction models on real-world single-view images. It introduces a self-training framework with pixel-level cycle-consistency and CLIP-based semantic guidance, paired with automatic data curation to select unoccluded instances. The approach blends synthetic multi-view supervision with real-image self-training and demonstrates consistent improvements across diverse real and synthetic benchmarks. The work highlights scalability and generalization potential for 3D foundation models in AR/VR and AIGC applications.

Abstract

The default strategy for training single-view Large Reconstruction Models (LRMs) follows the fully supervised route using large-scale datasets of synthetic 3D assets or multi-view captures. Although these resources simplify the training procedure, they are hard to scale up beyond the existing datasets and they are not necessarily representative of the real distribution of object shapes. To address these limitations, in this paper, we introduce Real3D, the first LRM system that can be trained using single-view real-world images. Real3D introduces a novel self-training framework that can benefit from both the existing synthetic data and diverse single-view real images. We propose two unsupervised losses that allow us to supervise LRMs at the pixel- and semantic-level, even for training examples without ground-truth 3D or novel views. To further improve performance and scale up the image data, we develop an automatic data curation approach to collect high-quality examples from in-the-wild images. Our experiments show that Real3D consistently outperforms prior work in four diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes. Code and model can be found here: https://hwjiang1510.github.io/Real3D/
Paper Structure (18 sections, 10 equations, 8 figures, 6 tables)

This paper contains 18 sections, 10 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Single-view 3D Reconstruction with Real3D. We compare Real3D with the state-of-the-art TripoSR model tochilkin2024triposr. Unlike TripoSR, which is trained solely on synthetic data, Real3D uses single-view real-world images. We provide reconstructions from two novel views. Please see more results in our website.
  • Figure 2: Real3D overview. (Top) Real3D is trained jointly on synthetic data (fully supervised) and on single-view real images using unsupervised losses. A curation strategy is used to identify and leverage the high-quality training instances from the initial image collection. (Bottom) We adopt the LRM model architecture.
  • Figure 3: Pixel-level Guidance using cycle-consistency. (Left) We show the forward and backward path of the cycle. (Right) Details of the pose sampling strategy with the curriculum.
  • Figure 4: Real3D reconstruction of in-the-wild instances. We show the input view and two novel views.
  • Figure 5: Real3D performance (PSNR) using different amounts of real data for training. The PSNR is evaluated on novel views for (a)-(c), and it is evaluated on (d) with self-consistency.
  • ...and 3 more figures