Floating No More: Object-Ground Reconstruction from a Single Image
Yunze Man, Yichen Sheng, Jianming Zhang, Liang-Yan Gui, Yu-Xiong Wang
TL;DR
ORG introduces Object Reconstruction with Ground (ORG), a framework for reconstructing a 3D object along with its ground plane and camera parameters from a single image to prevent floating artifacts. It leverages two dense pixel-level representations—pixel height and perspective field—and a Perspective Field Guided Pixel Height Reprojection to convert predictions into depth maps and point clouds, enabling grounded shadowing and pose-aware rendering. The approach uses a Pyramid Vision Transformer encoder with a SegFormer-style decoder to predict front/back pixel heights, latitude, and up-vector fields, trained end-to-end with regressions on these dense fields. Trained on a large, Blender-rendered Objaverse dataset, ORG outperforms depth-estimation, image-to-3D, and camera-parameter baselines across depth and point-cloud metrics and demonstrates robust zero-shot generalization to unseen objects and humans, with practical benefits for shadow generation and image composition.
Abstract
Recent advancements in 3D object reconstruction from single images have primarily focused on improving the accuracy of object shapes. Yet, these techniques often fail to accurately capture the inter-relation between the object, ground, and camera. As a result, the reconstructed objects often appear floating or tilted when placed on flat surfaces. This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. Our method uses two compact pixel-level representations to depict the relationship between camera, object, and ground. Experiments show that the proposed ORG model can effectively reconstruct object-ground geometry on unseen data, significantly enhancing the quality of shadow generation and pose manipulation compared to conventional single-image 3D reconstruction techniques.
