Table of Contents
Fetching ...

MuRF: Multi-Baseline Radiance Fields

Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, Fisher Yu

TL;DR

MuRF addresses sparse-view novel view synthesis across both small and large camera baselines by introducing a target-view frustum volume, which is aligned with the target image to effectively aggregate information from multiple input views. A multi-view feature encoder generates robust representations, while a (2+1)D CNN-based radiance field decoder regresses a full radiance field from a low-resolution volume, aided by hierarchical volume sampling for efficiency. The approach achieves state-of-the-art results on diverse datasets (e.g., DTU, RealEstate10K, LLFF) and exhibits promising zero-shot generalization on Mip-NeRF 360, demonstrating strong generalization across baselines without per-scene optimization. Overall, MuRF provides a geometry-aware, feed-forward solution that preserves sharp scene structures and scales to high-resolution rendering, with broad applicability to object-centric and scene-scale scenarios.

Abstract

We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward approach to solving sparse view synthesis under multiple different baseline settings (small and large baselines, and different number of input views). To render a target novel view, we discretize the 3D space into planes parallel to the target image plane, and accordingly construct a target view frustum volume. Such a target volume representation is spatially aligned with the target view, which effectively aggregates relevant information from the input views for high-quality rendering. It also facilitates subsequent radiance field regression with a convolutional network thanks to its axis-aligned nature. The 3D context modeled by the convolutional network enables our method to synthesis sharper scene structures than prior works. Our MuRF achieves state-of-the-art performance across multiple different baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K and LLFF). We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability of MuRF.

MuRF: Multi-Baseline Radiance Fields

TL;DR

MuRF addresses sparse-view novel view synthesis across both small and large camera baselines by introducing a target-view frustum volume, which is aligned with the target image to effectively aggregate information from multiple input views. A multi-view feature encoder generates robust representations, while a (2+1)D CNN-based radiance field decoder regresses a full radiance field from a low-resolution volume, aided by hierarchical volume sampling for efficiency. The approach achieves state-of-the-art results on diverse datasets (e.g., DTU, RealEstate10K, LLFF) and exhibits promising zero-shot generalization on Mip-NeRF 360, demonstrating strong generalization across baselines without per-scene optimization. Overall, MuRF provides a geometry-aware, feed-forward solution that preserves sharp scene structures and scales to high-resolution rendering, with broad applicability to object-centric and scene-scale scenarios.

Abstract

We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward approach to solving sparse view synthesis under multiple different baseline settings (small and large baselines, and different number of input views). To render a target novel view, we discretize the 3D space into planes parallel to the target image plane, and accordingly construct a target view frustum volume. Such a target volume representation is spatially aligned with the target view, which effectively aggregates relevant information from the input views for high-quality rendering. It also facilitates subsequent radiance field regression with a convolutional network thanks to its axis-aligned nature. The 3D context modeled by the convolutional network enables our method to synthesis sharper scene structures than prior works. Our MuRF achieves state-of-the-art performance across multiple different baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K and LLFF). We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability of MuRF.
Paper Structure (15 sections, 2 equations, 11 figures, 7 tables)

This paper contains 15 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: MuRF supports multiple different baseline settings. Previous methods are specifically designed for either small (e.g., ENeRF lin2022efficient) or large (e.g., AttnRend du2023learning) baselines. However, no existing method performs well on both (see Table \ref{['tab:small_large']}).
  • Figure 2: Overview. Given multiple input images, we first extract multi-view image features with a multi-view Transformer. To render a target image of resolution $H \times W$, we construct a target view frustum volume by performing $8\times$ subsampling in the spatial dimension while casting rays and sampling $D$ equidistant points on each ray. For each 3D point, we sample feature and color information from the extracted feature maps and input images, which consists of the elements of the target volume ${\bm z} \in \mathbb{R}^{\frac{H}{8} \times \frac{W}{8} \times D \times C}$. Here, $C$ denotes the channel dimension after aggregating sampled features and colors. To reconstruct the radiance field from the volume, we model the context information in the decoder with a (2+1)D CNN operating on low resolution and subsequently obtain the full-resolution radiance field with a lightweight $8 \times$ upsampler. The target image is finally rendered with volumetric rendering.
  • Figure 3: Visual comparisons with previous best methods on DTU, RealEstate10K and LLFF datasets.
  • Figure 4: Zero-shot generalization on Mip-NeRF 360 dataset.
  • Figure 5: Our target view volume vs. reference (first image) view volume. Constructing the volume at the pre-defined reference view space might miss relevant information (e.g., red arrows regions) in other views since such information could be far away from the the reference view's epipolar lines and thus hard to be sampled. In contrast, we construct the volume in the target view, which more effectively aggregates information from all input views and thus maximizes the information usage.
  • ...and 6 more figures