Table of Contents
Fetching ...

sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views

Eyvaz Najafli, Marius Kästingschäfer, Sebastian Bernhard, Thomas Brox, Andreas Geiger

TL;DR

sshELF addresses the challenge of reconstructing unbounded outdoor scenes from sparse outward-facing views by introducing a fast two-stage pipeline that first generates multiple intermediate virtual views and then decodes them into 3D Gaussian primitives. By disentangling latent feature extrapolation from primitive decoding and leveraging cross-scene priors along with pretrained foundation models, the method achieves accurate occlusion-filled reconstructions and real-time novel-view rendering from six views. Key contributions include the hierarchical ELF blocks for virtual-view extrapolation, the UNet-like translator mapping to Gaussian splats, and the integration of a DINOv2 latent encoder with depth priors for robust cross-scene generalization. Experiments on SEED4D and nuScenes demonstrate competitive performance, real-time speed, and faithful reconstruction of occluded regions, highlighting practical relevance for surround-view and autonomous driving applications.

Abstract

Reconstructing unbounded outdoor scenes from sparse outward-facing views poses significant challenges due to minimal view overlap. Previous methods often lack cross-scene understanding and their primitive-centric formulations overload local features to compensate for missing global context, resulting in blurriness in unseen parts of the scene. We propose sshELF, a fast, single-shot pipeline for sparse-view 3D scene reconstruction via hierarchal extrapolation of latent features. Our key insights is that disentangling information extrapolation from primitive decoding allows efficient transfer of structural patterns across training scenes. Our method: (1) learns cross-scene priors to generate intermediate virtual views to extrapolate to unobserved regions, (2) offers a two-stage network design separating virtual view generation from 3D primitive decoding for efficient training and modular model design, and (3) integrates a pre-trained foundation model for joint inference of latent features and texture, improving scene understanding and generalization. sshELF can reconstruct 360 degree scenes from six sparse input views and achieves competitive results on synthetic and real-world datasets. We find that sshELF faithfully reconstructs occluded regions, supports real-time rendering, and provides rich latent features for downstream applications. The code will be released.

sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views

TL;DR

sshELF addresses the challenge of reconstructing unbounded outdoor scenes from sparse outward-facing views by introducing a fast two-stage pipeline that first generates multiple intermediate virtual views and then decodes them into 3D Gaussian primitives. By disentangling latent feature extrapolation from primitive decoding and leveraging cross-scene priors along with pretrained foundation models, the method achieves accurate occlusion-filled reconstructions and real-time novel-view rendering from six views. Key contributions include the hierarchical ELF blocks for virtual-view extrapolation, the UNet-like translator mapping to Gaussian splats, and the integration of a DINOv2 latent encoder with depth priors for robust cross-scene generalization. Experiments on SEED4D and nuScenes demonstrate competitive performance, real-time speed, and faithful reconstruction of occluded regions, highlighting practical relevance for surround-view and autonomous driving applications.

Abstract

Reconstructing unbounded outdoor scenes from sparse outward-facing views poses significant challenges due to minimal view overlap. Previous methods often lack cross-scene understanding and their primitive-centric formulations overload local features to compensate for missing global context, resulting in blurriness in unseen parts of the scene. We propose sshELF, a fast, single-shot pipeline for sparse-view 3D scene reconstruction via hierarchal extrapolation of latent features. Our key insights is that disentangling information extrapolation from primitive decoding allows efficient transfer of structural patterns across training scenes. Our method: (1) learns cross-scene priors to generate intermediate virtual views to extrapolate to unobserved regions, (2) offers a two-stage network design separating virtual view generation from 3D primitive decoding for efficient training and modular model design, and (3) integrates a pre-trained foundation model for joint inference of latent features and texture, improving scene understanding and generalization. sshELF can reconstruct 360 degree scenes from six sparse input views and achieves competitive results on synthetic and real-world datasets. We find that sshELF faithfully reconstructs occluded regions, supports real-time rendering, and provides rich latent features for downstream applications. The code will be released.

Paper Structure

This paper contains 15 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview. Given a number of input images, sshELF first reconstructs several virtual views and only then predicts the 3D Gaussian primitives of the scene from which novel views are rendered. The colors of the latent information correspond to different object classes, such as purple for buildings and green for vegetation.
  • Figure 2: Reference, Virtual and Novel Views. An example showing input views in green, a set of virtual views in red, and potential novel views in blue. Virtual view generation is key to enhancing representational capacity and extrapolating to unobserved scene areas.
  • Figure 3: Overview of sshELF. Given a few input images, sshELF first encodes them into latent features using a pre-trained DinoV2 (Sec.\ref{['image_encoder']}). As part of the backbone, the latent features, together with a pre-trained depth head, are used to initialize the virtual views, which are refined using hierarchical ELF blocks consisting of cross- and self-attention layers (Sec. \ref{['backbone']}). Reference and virtual views are then fed into the translator part to predict 3D Gaussian splats (Sec. \ref{['translator']}). Not shown here is the rasterization part used for creating novel views (Sec. \ref{['rendering_nvs']}).
  • Figure 4: Qualitative Novel View Synthesis Comparison on SEED4D Test Set. Comparison of large-baseline novel view synthesis under sparse observation conditions. Six ego-centric input frames (top row) with limited overlap serve as reference views. We evaluate each method's ability to reconstruct exo-centric with a large offset to the input views.
  • Figure 5: Qualitative Novel View Synthesis Comparison on nuScenes Test Set. Visualization of multi-view synthesis results using six reference views captured at t=0. We compare novel views reconstructed at temporal difference of TD=2, 3, and 4 (1s, 1.5s, and 2s, respectively).