Table of Contents
Fetching ...

Variable Radiance Field for Real-World Category-Specific Reconstruction from Single Image

Kun Wang, Zhiqiang Yan, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang

TL;DR

This work tackles single-image category-specific 3D reconstruction by moving away from projection-based local feature retrieval that depends on known camera intrinsics. The proposed Variable Radiance Field (VRF) uses global latent encodings for geometry and appearance, a learnable category-specific shape template to align instances into a canonical space, and a hyper-network-driven, compact NeRF for fast, instance-specific rendering. Key contributions include the Object Encoding Module (OEM) for intrinsic-free feature extraction, the Dynamic Ray Sampling Module (DRSM) for canonical alignment, and the Instance Creation Module (ICM) for efficient NeRF generation, complemented by contrastive pretraining. Evaluations on the CO3D dataset show VRF achieves state-of-the-art reconstruction quality and faster rendering compared to existing methods, with robustness to occlusions and viewpoint variations, enabling practical real-world use.

Abstract

Reconstructing category-specific objects using Neural Radiance Field (NeRF) from a single image is a promising yet challenging task. Existing approaches predominantly rely on projection-based feature retrieval to associate 3D points in the radiance field with local image features from the reference image. However, this process is computationally expensive, dependent on known camera intrinsics, and susceptible to occlusions. To address these limitations, we propose Variable Radiance Field (VRF), a novel framework capable of efficiently reconstructing category-specific objects without requiring known camera intrinsics and demonstrating robustness against occlusions. First, we replace the local feature retrieval with global latent representations, generated through a single feed-forward pass, which improves efficiency and eliminates reliance on camera intrinsics. Second, to tackle coordinate inconsistencies inherent in real-world dataset, we define a canonical space by introducing a learnable, category-specific shape template and explicitly aligning each training object to this template using a learnable 3D transformation. This approach also reduces the complexity of geometry prediction to modeling deformations from the template to individual instances. Finally, we employ a hyper-network-based method for efficient NeRF creation and enhance the reconstruction performance through a contrastive learning-based pretraining strategy. Evaluations on the CO3D dataset demonstrate that VRF achieves state-of-the-art performance in both reconstruction quality and computational efficiency.

Variable Radiance Field for Real-World Category-Specific Reconstruction from Single Image

TL;DR

This work tackles single-image category-specific 3D reconstruction by moving away from projection-based local feature retrieval that depends on known camera intrinsics. The proposed Variable Radiance Field (VRF) uses global latent encodings for geometry and appearance, a learnable category-specific shape template to align instances into a canonical space, and a hyper-network-driven, compact NeRF for fast, instance-specific rendering. Key contributions include the Object Encoding Module (OEM) for intrinsic-free feature extraction, the Dynamic Ray Sampling Module (DRSM) for canonical alignment, and the Instance Creation Module (ICM) for efficient NeRF generation, complemented by contrastive pretraining. Evaluations on the CO3D dataset show VRF achieves state-of-the-art reconstruction quality and faster rendering compared to existing methods, with robustness to occlusions and viewpoint variations, enabling practical real-world use.

Abstract

Reconstructing category-specific objects using Neural Radiance Field (NeRF) from a single image is a promising yet challenging task. Existing approaches predominantly rely on projection-based feature retrieval to associate 3D points in the radiance field with local image features from the reference image. However, this process is computationally expensive, dependent on known camera intrinsics, and susceptible to occlusions. To address these limitations, we propose Variable Radiance Field (VRF), a novel framework capable of efficiently reconstructing category-specific objects without requiring known camera intrinsics and demonstrating robustness against occlusions. First, we replace the local feature retrieval with global latent representations, generated through a single feed-forward pass, which improves efficiency and eliminates reliance on camera intrinsics. Second, to tackle coordinate inconsistencies inherent in real-world dataset, we define a canonical space by introducing a learnable, category-specific shape template and explicitly aligning each training object to this template using a learnable 3D transformation. This approach also reduces the complexity of geometry prediction to modeling deformations from the template to individual instances. Finally, we employ a hyper-network-based method for efficient NeRF creation and enhance the reconstruction performance through a contrastive learning-based pretraining strategy. Evaluations on the CO3D dataset demonstrate that VRF achieves state-of-the-art performance in both reconstruction quality and computational efficiency.
Paper Structure (23 sections, 16 equations, 10 figures, 4 tables)

This paper contains 23 sections, 16 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: (a) Projection-based feature retrieval correctly locates features from nearby input views but often samples incorrect features from distant views due to occlusions. (b) In real-world datasets, camera poses are individually registered for each instance using SfM methods, resulting in arbitrary coordinate orientations (e.g. the blue or green axes) and scales. This leads to coordinate misalignment across different instances.
  • Figure 2: Overall pipeline of our VRF framework. The feature extractor $f_e$ is pre-trained using our contrastive learning-based strategy. $W$ represents the learned transformation that aligns each instance with the template space, while $\pi(\cdot)$ refers to the back-projection operation from the image plane to the camera space. $\hat{E}$ and $\hat{K}$ denote the arbitrary camera pose and intrinsic parameters, respectively.
  • Figure 3: Illustration of the contrastive learning-based pre-training strategy. The feature extractor $f_e$ is trained to generate similar representations for images from the same object instance and dissimilar representations for images from different instances.
  • Figure 4: Visual comparison of the similarity of $z'$ across different image pairs. After pre-training, the feature extractor $f_e$ generates similar representations for different input images of the same instance, while producing dissimilar representations for images of different instances.
  • Figure 5: Illustration of instance alignment. We render three different hydrant instances using the same camera pose and intrinsic parameters. All instances are correctly aligned with the predefined template. The template depth is rendered using Eq. \ref{['eq.depth']}.
  • ...and 5 more figures