Table of Contents
Fetching ...

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Cheng-You Lu, Zhuoli Zhuang, Nguyen Thanh Trung Le, Da Xiao, Yu-Cheng Chang, Thomas Do, Srinath Sridhar, Chin-teng Lin

TL;DR

Hestia presents a voxel-face-aware, hierarchical NBV planner that treats voxels as cubes to better capture geometry during 5-DoF viewpoint prediction. By using a two-stage network (look-at point then gaze location) and a close-greedy reinforcement learning objective, it achieves real-time inference (≈25 FPS) with substantial gains in coverage and reconstruction accuracy across diverse object categories. The approach is trained on a large, diverse Objaverse-derived dataset and validated on OmniObject3D, Objaverse, and Houses3K, showing robust performance under translation and limited-view budgets, and it is demonstrated in real-world drone experiments. These results indicate strong practical potential for efficient, automated 3D reconstruction in object-centric scenes, with future work pointing toward multi-agent extensions and outdoor deployments.

Abstract

Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

TL;DR

Hestia presents a voxel-face-aware, hierarchical NBV planner that treats voxels as cubes to better capture geometry during 5-DoF viewpoint prediction. By using a two-stage network (look-at point then gaze location) and a close-greedy reinforcement learning objective, it achieves real-time inference (≈25 FPS) with substantial gains in coverage and reconstruction accuracy across diverse object categories. The approach is trained on a large, diverse Objaverse-derived dataset and validated on OmniObject3D, Objaverse, and Houses3K, showing robust performance under translation and limited-view budgets, and it is demonstrated in real-world drone experiments. These results indicate strong practical potential for efficient, automated 3D reconstruction in object-centric scenes, with future work pointing toward multi-agent extensions and outdoor deployments.

Abstract

Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.

Paper Structure

This paper contains 31 sections, 21 equations, 20 figures, 7 tables, 1 algorithm.

Figures (20)

  • Figure 1: A voxel is worth more than a ray. Unlike the RL-based generalizable method chen2024gennbv, Hestia treats each voxel as a cube by considering its six faces, rather than a point. This reduces the information loss inherent in point approximations, ensuring a more accurate representation of the voxel.
  • Figure 2: Hierarchical structure of Hestia. Hestia first predicts the camera's look-at point $L_t$ using a proposal neural network that takes grid information $G_t$ processed from the depth image $D_t$ and the camera pose as input. Next, Hestia employs a grid encoder to encode the grid information $G_t$ and performs trilinear interpolation to extract corresponding features from the encoded grid at different layers based on $L_t$. These multilevel interpolated features are then concatenated with the vector information $M_t$ which includes the camera pose $X_t$ and the maximum flyable height, $H_t$ as well as the encoded image features. The image features are extracted using an image encoder, which takes the grayscale image $I_t$ as input. Finally, this combined feature representation is fed into the RL policy model to predict the camera's position $a_t$. Note that Hestia adopts $a'_t$, the nearest collision-free point to $a_t$, as the final camera position to ensure a collision-free viewpoint. Hence, the next-best viewpoint $\{a'_t, L_t\}$ is used for data collection.
  • Figure 3: Point cloud reconstruction on three datasets. Hestia’s reconstructions are visibly better than those of prior approaches.
  • Figure 4: Point cloud reconstruction on three datasets. Hestia’s reconstructions are visibly better than those of prior approaches.
  • Figure 5: Real-world demonstration of non-shifted and shifted scenes. Red boxes indicate manually initialized viewpoints, while blue boxes denote viewpoints predicted by Hestia. The results demonstrate Hestia’s feasibility in real-world environments.
  • ...and 15 more figures