Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

Christian Simon; Sen He; Juan-Manuel Perez-Rua; Mengmeng Xu; Amine Benhalloum; Tao Xiang

Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

Christian Simon, Sen He, Juan-Manuel Perez-Rua, Mengmeng Xu, Amine Benhalloum, Tao Xiang

TL;DR

This work introduces a novel neural rendering technique that employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks, and builds neural encoding volumes from generated multi-view inputs.

Abstract

Solving image-to-3D from a single view is an ill-posed problem, and current neural reconstruction methods addressing it through diffusion models still rely on scene-specific optimization, constraining their generalization capability. To overcome the limitations of existing approaches regarding generalization and consistency, we introduce a novel neural rendering technique. Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks. Specifically, our method builds neural encoding volumes from generated multi-view inputs. We adjust the weights of the SDF network conditioned on an input image at test-time to allow model adaptation to novel scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts derived from the synthesized views, we propose the use of a volume transformer module to improve the aggregation of image features instead of processing each viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we avoid the bottleneck of scene-specific optimization and maintain consistency across the images generated from multiple viewpoints. Our experiments show the advantages of our proposed approach with consistent results and rapid generation.

Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

TL;DR

Abstract

Paper Structure (28 sections, 10 equations, 7 figures, 2 tables)

This paper contains 28 sections, 10 equations, 7 figures, 2 tables.

Introduction
Related Work
Diffusion models for 2D to 3D reconstruction.
Generalizable priors for fast 3D reconstruction.
Context-based learning.
Proposed Method
One to multiple-view images
Synthesized views.
Geometry-Aware Encoding
Neural encoding volume.
Volume Rendering
Signed Distance Function (SDF).
HyperNetworks for an SDF network.
Rendering from SDFs.
VolTran: multi-view aggregation transformer.
...and 13 more sections

Figures (7)

Figure 1: Top: Comparison of our proposed method against baselines on the running time and Chamfer Distance with the bubble area indicating IoU. Bottom: Our pipeline comprises two components for image-to-3D by synthesizing multi-views from a diffusion model and mapping from multi-views to SDFs using an SDF network with weights generated from a HyperNetwork.
Figure 2: Our training pipeline starts from a single image. Expanding a single view to an image set using a viewpoint-aware generation model, our method employs supervised learning with RGB and depth regression losses. Specifically, 1) Utilizing $N$ RGB images and depth maps, we generate additional viewpoints and camera poses. 2) Geometry-Guided Encoding is derived from warped image features in the form of a Cost Volume. 3) Instead of test-time optimization, we obtain SDF weights with a single pass of a HyperNetwork module, considering image appearance through visual encoding. 4) The geometry-encoded volume and the image features are passed to the SDF network and a transformer module to reveal the complete 3D object structure. Hence, our method Hyper-VolTran encompasses quick adaption to novel inputs thanks to our HyperNetwork design and consistent structures from global attention.
Figure 3: Qualitative results of Hyper-Voltran on text-to-3D colored meshes. The generated images from a diffusion model are used as inputs. We only focus on the main object of the input image.
Figure 4: Qualitative comparison on single image to 3D reconstruction with previous workse.g., One2345 liu2023one2345, Shap-e jun2023shape, Point-e nichol2022pointe, and Zero123+SD poole2022dreamfusion. VolTran offers more consistent and higher-quality results than competitors, generally providing a higher level of preservation of input details. Please see our supplementary material for more results and zoomed-in details.
Figure 5: Examples of inconsistently generated views and comparison of our proposed method against One2345 liu2023one2345 in generating meshes. One2345 fails to build well-reconstructed meshes when the views are arguably inconsistent and challenging.
...and 2 more figures

Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

TL;DR

Abstract

Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)