Table of Contents
Fetching ...

Floating No More: Object-Ground Reconstruction from a Single Image

Yunze Man, Yichen Sheng, Jianming Zhang, Liang-Yan Gui, Yu-Xiong Wang

TL;DR

ORG introduces Object Reconstruction with Ground (ORG), a framework for reconstructing a 3D object along with its ground plane and camera parameters from a single image to prevent floating artifacts. It leverages two dense pixel-level representations—pixel height and perspective field—and a Perspective Field Guided Pixel Height Reprojection to convert predictions into depth maps and point clouds, enabling grounded shadowing and pose-aware rendering. The approach uses a Pyramid Vision Transformer encoder with a SegFormer-style decoder to predict front/back pixel heights, latitude, and up-vector fields, trained end-to-end with regressions on these dense fields. Trained on a large, Blender-rendered Objaverse dataset, ORG outperforms depth-estimation, image-to-3D, and camera-parameter baselines across depth and point-cloud metrics and demonstrates robust zero-shot generalization to unseen objects and humans, with practical benefits for shadow generation and image composition.

Abstract

Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and camera. As a result, the reconstructed objects often appear floating or tilted when placed on flat surfaces. This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. Our method uses two compact pixel-level representations to depict the relationship between camera, object, and ground. Experiments show that the proposed ORG model can effectively reconstruct object-ground geometry on unseen data, significantly enhancing the quality of shadow generation and pose manipulation compared to conventional single-image 3D reconstruction techniques.

Floating No More: Object-Ground Reconstruction from a Single Image

TL;DR

ORG introduces Object Reconstruction with Ground (ORG), a framework for reconstructing a 3D object along with its ground plane and camera parameters from a single image to prevent floating artifacts. It leverages two dense pixel-level representations—pixel height and perspective field—and a Perspective Field Guided Pixel Height Reprojection to convert predictions into depth maps and point clouds, enabling grounded shadowing and pose-aware rendering. The approach uses a Pyramid Vision Transformer encoder with a SegFormer-style decoder to predict front/back pixel heights, latitude, and up-vector fields, trained end-to-end with regressions on these dense fields. Trained on a large, Blender-rendered Objaverse dataset, ORG outperforms depth-estimation, image-to-3D, and camera-parameter baselines across depth and point-cloud metrics and demonstrates robust zero-shot generalization to unseen objects and humans, with practical benefits for shadow generation and image composition.

Abstract

Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and camera. As a result, the reconstructed objects often appear floating or tilted when placed on flat surfaces. This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. Our method uses two compact pixel-level representations to depict the relationship between camera, object, and ground. Experiments show that the proposed ORG model can effectively reconstruct object-ground geometry on unseen data, significantly enhancing the quality of shadow generation and pose manipulation compared to conventional single-image 3D reconstruction techniques.
Paper Structure (18 sections, 4 equations, 9 figures, 4 tables)

This paper contains 18 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Our proposed ORG (Object Reconstruction with Ground) model simultaneously reconstructs a 3D object, estimates camera parameters, and models the object-ground relationship from a monocular image. During shadow and reflection generation, the prior depth-based object geometry estimation method can result in floating issue or an unnatural shadow on the ground, as demonstrated in red boxes. Our method, on the other hand, achieves significantly more realistic editing and generation, as shown in blue boxes.
  • Figure 2: Without modeling object-ground correlation, existing single-view 3D estimation method ranftl2020midas generates 3D models floating or tilted on the ground.
  • Figure 3: ORG Paradigm. Our proposed method is able to take a single-view object-centric image as input, and jointly estimate two dense representations, the pixel height and perspective field, encoding the object-ground relationship and camera parameters, respectively. A Perspective Field Guided Pixel Height Re-projection module is proposed to repurpose the two predicted dense fields into depth map estimation and point cloud generation.
  • Figure 4: Perspective-Guided Pixel Height Reprojection. PField and PixHt are perspective field and pixel height, respectively.
  • Figure 5: Qualitative results of shadow and reflection generation on the ground, as well as object-ground reconstruction and depth estimation. We show comparison with the depth-based estimation method LeReS yin2021leres and monocular novel view synthesis method Zero-123 liu2023zero. ORG maintains great object-ground relationship compared with prior methods which leads to much more realistic shadow and reflection generation, as shown in the blue boxes. Our method runs very fast and can easily output representations like depth map and point cloud.
  • ...and 4 more figures