Table of Contents
Fetching ...

ZeroShape: Regression-based Zero-shot Shape Reconstruction

Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg

TL;DR

ZeroShape tackles single-image zero-shot 3D shape reconstruction with a regression-based approach that regresses a view-centric occupancy field. It introduces a geometric processing unit consisting of depth/intrinsics estimation, a differentiable unprojection to a projection map, and a projection-guided cross-attention reconstructor, all trained with a two-stage loss culminating in 3D occupancy supervision. A large, standardized benchmark—built from ShapeNet, Objaverse, OmniObject3D, Ocrtoc3D, and Pix3D—enables robust evaluation of zero-shot generalization, showing ZeroShape achieves state-of-the-art results while using significantly less data and compute than prior generative methods. This work shifts the paradigm toward efficient, regression-based zero-shot 3D reconstruction and provides a valuable, scalable evaluation resource for the community.

Abstract

We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time. In contrast, the traditional approach to this problem is regression-based, where deterministic models are trained to directly regress the object shape. Such regression methods possess much higher computational efficiency than generative methods. This raises a natural question: is generative modeling necessary for high performance, or conversely, are regression-based approaches still competitive? To answer this, we design a strong regression-based model, called ZeroShape, based on the converging findings in this field and a novel insight. We also curate a large real-world evaluation benchmark, with objects from three different real-world 3D datasets. This evaluation benchmark is more diverse and an order of magnitude larger than what prior works use to quantitatively evaluate their models, aiming at reducing the evaluation variance in our field. We show that ZeroShape not only achieves superior performance over state-of-the-art methods, but also demonstrates significantly higher computational and data efficiency.

ZeroShape: Regression-based Zero-shot Shape Reconstruction

TL;DR

ZeroShape tackles single-image zero-shot 3D shape reconstruction with a regression-based approach that regresses a view-centric occupancy field. It introduces a geometric processing unit consisting of depth/intrinsics estimation, a differentiable unprojection to a projection map, and a projection-guided cross-attention reconstructor, all trained with a two-stage loss culminating in 3D occupancy supervision. A large, standardized benchmark—built from ShapeNet, Objaverse, OmniObject3D, Ocrtoc3D, and Pix3D—enables robust evaluation of zero-shot generalization, showing ZeroShape achieves state-of-the-art results while using significantly less data and compute than prior generative methods. This work shifts the paradigm toward efficient, regression-based zero-shot 3D reconstruction and provides a valuable, scalable evaluation resource for the community.

Abstract

We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time. In contrast, the traditional approach to this problem is regression-based, where deterministic models are trained to directly regress the object shape. Such regression methods possess much higher computational efficiency than generative methods. This raises a natural question: is generative modeling necessary for high performance, or conversely, are regression-based approaches still competitive? To answer this, we design a strong regression-based model, called ZeroShape, based on the converging findings in this field and a novel insight. We also curate a large real-world evaluation benchmark, with objects from three different real-world 3D datasets. This evaluation benchmark is more diverse and an order of magnitude larger than what prior works use to quantitatively evaluate their models, aiming at reducing the evaluation variance in our field. We show that ZeroShape not only achieves superior performance over state-of-the-art methods, but also demonstrates significantly higher computational and data efficiency.
Paper Structure (22 sections, 2 equations, 12 figures, 4 tables)

This paper contains 22 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: We outperform SOTA methods for zero-shot 3D shape reconstruction, while having faster inference time and less training data. Circle size indicates the number of 3D assets used for training, with biggest being 3M. F-Score with threshold 0.05 is averaged over Octroc3D shrestha2022ocrctoc, Pix3D sun2018pix3d and OmniObject3D wu2023omniobject3d.
  • Figure 2: ZeroShape reconstructions from in-the-wild images. Our method produces detailed and accurate object reconstructions from single-view images on a diverse set of objects.
  • Figure 3: Overview of our model. Our consists of three modules: a depth and camera estimator, a geometric unprojection unit and a projection-guided shape reconstructor. The depth and camera estimator predicts the depth and camera intrinsics from the input image with a DPT backbone. The geometric unprojection unit converts the depth and intrinsics estimation into a normalized 3D visible surface, which is parameterized by a three-channel projection map. The shape reconstructor finally reconstructs the full occupancy field by fetching localized information from projection map through cross attention.
  • Figure 4: Effect of Intrinsics. Unprojecting an accurate depth map into a 3D surface surface with erroneous intrinsics leads to skewed shape with wrong 3D aspect ratio.
  • Figure 5: Qualitative results. We compare ZeroShape to other SOTA methods on our curated benchmark (first three columns are from Ocrtoc3D shrestha2022ocrctoc, last three are from OmniObject3D wu2023omniobject3d). Our reconstruction not only better aligns with the visible surfaces from images, but also recovers a faithful global structure of the reconstructed objects.
  • ...and 7 more figures