sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views
Eyvaz Najafli, Marius Kästingschäfer, Sebastian Bernhard, Thomas Brox, Andreas Geiger
TL;DR
sshELF addresses the challenge of reconstructing unbounded outdoor scenes from sparse outward-facing views by introducing a fast two-stage pipeline that first generates multiple intermediate virtual views and then decodes them into 3D Gaussian primitives. By disentangling latent feature extrapolation from primitive decoding and leveraging cross-scene priors along with pretrained foundation models, the method achieves accurate occlusion-filled reconstructions and real-time novel-view rendering from six views. Key contributions include the hierarchical ELF blocks for virtual-view extrapolation, the UNet-like translator mapping to Gaussian splats, and the integration of a DINOv2 latent encoder with depth priors for robust cross-scene generalization. Experiments on SEED4D and nuScenes demonstrate competitive performance, real-time speed, and faithful reconstruction of occluded regions, highlighting practical relevance for surround-view and autonomous driving applications.
Abstract
Reconstructing unbounded outdoor scenes from sparse outward-facing views poses significant challenges due to minimal view overlap. Previous methods often lack cross-scene understanding and their primitive-centric formulations overload local features to compensate for missing global context, resulting in blurriness in unseen parts of the scene. We propose sshELF, a fast, single-shot pipeline for sparse-view 3D scene reconstruction via hierarchal extrapolation of latent features. Our key insights is that disentangling information extrapolation from primitive decoding allows efficient transfer of structural patterns across training scenes. Our method: (1) learns cross-scene priors to generate intermediate virtual views to extrapolate to unobserved regions, (2) offers a two-stage network design separating virtual view generation from 3D primitive decoding for efficient training and modular model design, and (3) integrates a pre-trained foundation model for joint inference of latent features and texture, improving scene understanding and generalization. sshELF can reconstruct 360 degree scenes from six sparse input views and achieves competitive results on synthetic and real-world datasets. We find that sshELF faithfully reconstructs occluded regions, supports real-time rendering, and provides rich latent features for downstream applications. The code will be released.
