Table of Contents
Fetching ...

SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration

Daniel Duckworth, Peter Hedman, Christian Reiser, Peter Zhizhin, Jean-François Thibert, Mario Lučić, Richard Szeliski, Jonathan T. Barron

TL;DR

SMERF tackles real-time, high-fidelity view synthesis for large-scale scenes under strict memory constraints by distilling a Zip-NeRF teacher into a hierarchical MERF-based student. It employs coordinate-space $K^3$ subvolumes, a deferred appearance network partitioned on a $P^3$ lattice, and feature gating to boost capacity while keeping per-frame costs low, enabling rendering in a web browser on commodity devices. A distillation training regime with appearance and geometry losses, data augmentation, and submodel-consistency regularization, combined with a distance-grid accelerated live viewer, yields PSNR gains of up to $0.78$ dB (and $1.78$ dB on large scenes) over prior real-time methods. The approach demonstrates that memory and compute can be kept effectively independent of scene size while achieving Zip-NeRF–like fidelity in real time, enabling practical large-scale, interactive exploration.

Abstract

Recent techniques for real-time view synthesis have rapidly advanced in fidelity and speed, and modern methods are capable of rendering near-photorealistic scenes at interactive frame rates. At the same time, a tension has arisen between explicit scene representations amenable to rasterization and neural fields built on ray marching, with state-of-the-art instances of the latter surpassing the former in quality while being prohibitively expensive for real-time applications. In this work, we introduce SMERF, a view synthesis approach that achieves state-of-the-art accuracy among real-time methods on large scenes with footprints up to 300 m$^2$ at a volumetric resolution of 3.5 mm$^3$. Our method is built upon two primary contributions: a hierarchical model partitioning scheme, which increases model capacity while constraining compute and memory consumption, and a distillation training strategy that simultaneously yields high fidelity and internal consistency. Our approach enables full six degrees of freedom (6DOF) navigation within a web browser and renders in real-time on commodity smartphones and laptops. Extensive experiments show that our method exceeds the current state-of-the-art in real-time novel view synthesis by 0.78 dB on standard benchmarks and 1.78 dB on large scenes, renders frames three orders of magnitude faster than state-of-the-art radiance field models, and achieves real-time performance across a wide variety of commodity devices, including smartphones. We encourage readers to explore these models interactively at our project website: https://smerf-3d.github.io.

SMERF: Streamable Memory Efficient Radiance Fields for Real-Time Large-Scene Exploration

TL;DR

SMERF tackles real-time, high-fidelity view synthesis for large-scale scenes under strict memory constraints by distilling a Zip-NeRF teacher into a hierarchical MERF-based student. It employs coordinate-space subvolumes, a deferred appearance network partitioned on a lattice, and feature gating to boost capacity while keeping per-frame costs low, enabling rendering in a web browser on commodity devices. A distillation training regime with appearance and geometry losses, data augmentation, and submodel-consistency regularization, combined with a distance-grid accelerated live viewer, yields PSNR gains of up to dB (and dB on large scenes) over prior real-time methods. The approach demonstrates that memory and compute can be kept effectively independent of scene size while achieving Zip-NeRF–like fidelity in real time, enabling practical large-scale, interactive exploration.

Abstract

Recent techniques for real-time view synthesis have rapidly advanced in fidelity and speed, and modern methods are capable of rendering near-photorealistic scenes at interactive frame rates. At the same time, a tension has arisen between explicit scene representations amenable to rasterization and neural fields built on ray marching, with state-of-the-art instances of the latter surpassing the former in quality while being prohibitively expensive for real-time applications. In this work, we introduce SMERF, a view synthesis approach that achieves state-of-the-art accuracy among real-time methods on large scenes with footprints up to 300 m at a volumetric resolution of 3.5 mm. Our method is built upon two primary contributions: a hierarchical model partitioning scheme, which increases model capacity while constraining compute and memory consumption, and a distillation training strategy that simultaneously yields high fidelity and internal consistency. Our approach enables full six degrees of freedom (6DOF) navigation within a web browser and renders in real-time on commodity smartphones and laptops. Extensive experiments show that our method exceeds the current state-of-the-art in real-time novel view synthesis by 0.78 dB on standard benchmarks and 1.78 dB on large scenes, renders frames three orders of magnitude faster than state-of-the-art radiance field models, and achieves real-time performance across a wide variety of commodity devices, including smartphones. We encourage readers to explore these models interactively at our project website: https://smerf-3d.github.io.
Paper Structure (50 sections, 17 equations, 10 figures, 16 tables)

This paper contains 50 sections, 17 equations, 10 figures, 16 tables.

Figures (10)

  • Figure 1: Coordinate systems in SMERF for a scene with $\mathrm{K}^3 = 3^3$ coordinate space partitions and $\mathrm{P}^3 = 4^3$ deferred appearance network sub-partitions. Each partition is capable of representing the entire scene while allocating the majority of its model capacity to its corresponding partition. Within each partition, we instantiate a set of spatially-anchored MLP weights $\{ \theta_{i,j} \}$ parameterizing the deferred appearance model, which we trilinearly interpolate as a function of the camera origin $\mathbf{o}$ during rendering. In (a), we present the entire scene in world coordinates with the scene partition and highlight twosubmodels. In (b) and (c) we present the same scene from the view of two submodels in their corresponding contracted coordinate systems. (b) visualizes the rendering and parameter interpolation process when the camera origin $\mathbf{o}$ lies inside of a submodel's partition, and (c) visualizes the same when it lies outside.
  • Figure 2: Teacher Supervision. The student receives photometric supervision via rendered colors and geometric supervision via the rendering weights along camera rays. Both models operate on the same set of ray intervals.
  • Figure 3: Ray jittering. To generate training rays for our student model (in gray) we randomly perturb the origins and directions of the camera rays used to supervise our teacher model (in red).
  • Figure 4: Qualitative comparison. We show results from our model and from 3D Gaussian Splatting kerbl20233d alongside ground-truth images on scenes from the mip-NeRF 360 barron2022mipnerf360 (left) and Zip-NeRF barron2023zipnerf (right) datasets. 3D Gaussian Splatting struggles to reproduce the thin geometry, high-frequency textures, and view-dependent effects which our model successfully recovers.
  • Figure 5: Feature ablations. We incrementally add distillation (MERF+D), optimization (MERF+DO), and model contributions (Ours) from Table \ref{['tab:mipnerf360_ablations']} to MERF to reach our submodel architecture. Distillation and optimization contributions markedly increase geometric and texture detail while model contributions improve view-dependent modeling accuracy.
  • ...and 5 more figures