Variable Radiance Field for Real-World Category-Specific Reconstruction from Single Image
Kun Wang, Zhiqiang Yan, Zhenyu Zhang, Xiang Li, Jun Li, Jian Yang
TL;DR
This work tackles single-image category-specific 3D reconstruction by moving away from projection-based local feature retrieval that depends on known camera intrinsics. The proposed Variable Radiance Field (VRF) uses global latent encodings for geometry and appearance, a learnable category-specific shape template to align instances into a canonical space, and a hyper-network-driven, compact NeRF for fast, instance-specific rendering. Key contributions include the Object Encoding Module (OEM) for intrinsic-free feature extraction, the Dynamic Ray Sampling Module (DRSM) for canonical alignment, and the Instance Creation Module (ICM) for efficient NeRF generation, complemented by contrastive pretraining. Evaluations on the CO3D dataset show VRF achieves state-of-the-art reconstruction quality and faster rendering compared to existing methods, with robustness to occlusions and viewpoint variations, enabling practical real-world use.
Abstract
Reconstructing category-specific objects using Neural Radiance Field (NeRF) from a single image is a promising yet challenging task. Existing approaches predominantly rely on projection-based feature retrieval to associate 3D points in the radiance field with local image features from the reference image. However, this process is computationally expensive, dependent on known camera intrinsics, and susceptible to occlusions. To address these limitations, we propose Variable Radiance Field (VRF), a novel framework capable of efficiently reconstructing category-specific objects without requiring known camera intrinsics and demonstrating robustness against occlusions. First, we replace the local feature retrieval with global latent representations, generated through a single feed-forward pass, which improves efficiency and eliminates reliance on camera intrinsics. Second, to tackle coordinate inconsistencies inherent in real-world dataset, we define a canonical space by introducing a learnable, category-specific shape template and explicitly aligning each training object to this template using a learnable 3D transformation. This approach also reduces the complexity of geometry prediction to modeling deformations from the template to individual instances. Finally, we employ a hyper-network-based method for efficient NeRF creation and enhance the reconstruction performance through a contrastive learning-based pretraining strategy. Evaluations on the CO3D dataset demonstrate that VRF achieves state-of-the-art performance in both reconstruction quality and computational efficiency.
