Table of Contents
Fetching ...

DM-OSVP++: One-Shot View Planning Using 3D Diffusion Models for Active RGB-Based Object Reconstruction

Sicong Pan, Liren Jin, Xuying Huang, Cyrill Stachniss, Marija Popović, Maren Bennewitz

TL;DR

DM-OSVP++ addresses RGB-based active object reconstruction by introducing one-shot view planning guided by priors from a 3D diffusion model. It uses EscherNet to generate a proxy mesh from a few reference views, then casts the planning problem as a customized set-covering optimization that accounts for geometric and textural complexity through PFHRGB-based entropy and multi-view constraints. The approach yields a globally shortest viewing path over an object-centric view space, enabling efficient data collection, and demonstrates compatibility with multiple RGB-based reconstruction backends (e.g., Instant-NGP, NeuS2, 2DGS). Real-world experiments with a UR5 robot show dynamic view-space adaptation and robust reconstruction under practical constraints, highlighting the method’s applicability to diverse objects and environments. Overall, DM-OSVP++ achieves a favorable trade-off between viewpoint efficiency and reconstruction quality by leveraging diffusion priors and a principled, object-specific planning framework.

Abstract

Active object reconstruction is crucial for many robotic applications. A key aspect in these scenarios is generating object-specific view configurations to obtain informative measurements for reconstruction. One-shot view planning enables efficient data collection by predicting all views at once, eliminating the need for time-consuming online replanning. Our primary insight is to leverage the generative power of 3D diffusion models as valuable prior information. By conditioning on initial multi-view images, we exploit the priors from the 3D diffusion model to generate an approximate object model, serving as the foundation for our view planning. Our novel approach integrates the geometric and textural distributions of the object model into the view planning process, generating views that focus on the complex parts of the object to be reconstructed. We validate the proposed active object reconstruction system through both simulation and real-world experiments, demonstrating the effectiveness of using 3D diffusion priors for one-shot view planning.

DM-OSVP++: One-Shot View Planning Using 3D Diffusion Models for Active RGB-Based Object Reconstruction

TL;DR

DM-OSVP++ addresses RGB-based active object reconstruction by introducing one-shot view planning guided by priors from a 3D diffusion model. It uses EscherNet to generate a proxy mesh from a few reference views, then casts the planning problem as a customized set-covering optimization that accounts for geometric and textural complexity through PFHRGB-based entropy and multi-view constraints. The approach yields a globally shortest viewing path over an object-centric view space, enabling efficient data collection, and demonstrates compatibility with multiple RGB-based reconstruction backends (e.g., Instant-NGP, NeuS2, 2DGS). Real-world experiments with a UR5 robot show dynamic view-space adaptation and robust reconstruction under practical constraints, highlighting the method’s applicability to diverse objects and environments. Overall, DM-OSVP++ achieves a favorable trade-off between viewpoint efficiency and reconstruction quality by leveraging diffusion priors and a principled, object-specific planning framework.

Abstract

Active object reconstruction is crucial for many robotic applications. A key aspect in these scenarios is generating object-specific view configurations to obtain informative measurements for reconstruction. One-shot view planning enables efficient data collection by predicting all views at once, eliminating the need for time-consuming online replanning. Our primary insight is to leverage the generative power of 3D diffusion models as valuable prior information. By conditioning on initial multi-view images, we exploit the priors from the 3D diffusion model to generate an approximate object model, serving as the foundation for our view planning. Our novel approach integrates the geometric and textural distributions of the object model into the view planning process, generating views that focus on the complex parts of the object to be reconstructed. We validate the proposed active object reconstruction system through both simulation and real-world experiments, demonstrating the effectiveness of using 3D diffusion priors for one-shot view planning.

Paper Structure

This paper contains 35 sections, 4 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: An example of our one-shot view planning in a real-world scenario. Our goal is to actively plan a set of views (blue) at once to collect informative RGB images for object reconstruction using the robotic arm. The key component in our approach is a multi-image-to-3D diffusion model generating the corresponding 3D mesh by conditioning on several RGB images captured from the initial reference views (red). By leveraging the mesh as both geometric and textural priors, our approach outputs view configurations specifically associated with the target object and calculates the globally shortest path. In particular, our method plans denser views to observe more complex parts (the upper part of the object in the example) to improve the reconstruction quality.
  • Figure 2: Illustration of the impact of multi-view constraints. $\alpha$ denotes the minimum number of views required to observe each surface point. Larger $\alpha$ values lead to optimization solutions with more views densely covering the surfaces.
  • Figure 3: Illustration of the geometric and textural complexity of two representative objects: one with geometric complexity and simple texture (top), and the other with simple geometry and complex texture (bottom). Regions with greater complexity are characterized by dramatic changes, such as significant curvature variations or pronounced color transitions.
  • Figure 4: Illustration of the impact of distance constraints: (a) spatially clustered views (the orange circle showcases an example of clustered views); (b) spatially more uniform views. Both view configurations are feasible solutions. By incorporating distance constraints, we express the preference for spatially uniform distribution to avoid redundant information in clustered view configurations.
  • Figure 5: An overview of our proposed active RGB-based 3D reconstruction system utilizing one-shot view planning with 3D diffusion models. Starting with two RGB images capturing an overview of the tabletop, the object to be reconstructed is roughly localized using the Voxel Carving method laurentini1994tpami. Given the localized carved mesh, we generate an object-centric view space and evaluate the reachability of each viewpoint with MoveIt chitta2012moveit, where green indicates reachable viewpoints, while black points represent non-reachable ones. Three reference view images (top, left, front) are collected as the input of the 3D diffusion model EscherNet kong2024cvpr to generate a 3D mesh, which serves as a proxy to the ground truth 3D model and is the basis for our one-shot view planner. Based on this prior, we construct the one-shot view planning task as a customized set covering optimization problem and solve it to obtain a minimum set of views required to densely cover the mesh surfaces. The RGB camera follows the generated globally shortest path (purple) to collect posed RGB images, which we use to obtain the 3D representation of the object after the data acquisition is completed.
  • ...and 9 more figures