Table of Contents
Fetching ...

LaRa: Efficient Large-Baseline Radiance Fields

Anpei Chen, Haofei Xu, Stefano Esposito, Siyu Tang, Andreas Geiger

TL;DR

LaRa addresses the challenge of reconstructing high-fidelity 360° radiance fields from sparse, large-baseline views without per-scene optimization. It introduces a Gaussian volume representation and a Volume Transformer with Group Attention to implicitly match features across local and global contexts, combined with a coarse-to-fine decoding and differentiable splatting for efficient high-resolution rendering. The approach demonstrates strong zero-shot and cross-domain generalization, achieving superior geometry and texture quality on multiple datasets while reducing training resources (e.g., 2 days on 4 A100 GPUs for the fast model). This work advances large-baseline radiance-field reconstruction with practical efficiency and broad applicability to mesh extraction and view synthesis from sparse inputs.

Abstract

Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction. But they are mostly applied in per-scene optimization or small-baseline settings. While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction. We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360 deg radiance fields, and robustness to zero-shot and out-of-domain testing. Our project Page: https://apchenstu.github.io/LaRa/.

LaRa: Efficient Large-Baseline Radiance Fields

TL;DR

LaRa addresses the challenge of reconstructing high-fidelity 360° radiance fields from sparse, large-baseline views without per-scene optimization. It introduces a Gaussian volume representation and a Volume Transformer with Group Attention to implicitly match features across local and global contexts, combined with a coarse-to-fine decoding and differentiable splatting for efficient high-resolution rendering. The approach demonstrates strong zero-shot and cross-domain generalization, achieving superior geometry and texture quality on multiple datasets while reducing training resources (e.g., 2 days on 4 A100 GPUs for the fast model). This work advances large-baseline radiance-field reconstruction with practical efficiency and broad applicability to mesh extraction and view synthesis from sparse inputs.

Abstract

Radiance field methods have achieved photorealistic novel view synthesis and geometry reconstruction. But they are mostly applied in per-scene optimization or small-baseline settings. While several recent works investigate feed-forward reconstruction with large baselines by utilizing transformers, they all operate with a standard global attention mechanism and hence ignore the local nature of 3D reconstruction. We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. Our model represents scenes as Gaussian Volumes and combines this with an image encoder and Group Attention Layers for efficient feed-forward reconstruction. Experimental results demonstrate that our model, trained for two days on four GPUs, demonstrates high fidelity in reconstructing 360 deg radiance fields, and robustness to zero-shot and out-of-domain testing. Our project Page: https://apchenstu.github.io/LaRa/.
Paper Structure (15 sections, 7 equations, 10 figures, 3 tables)

This paper contains 15 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: LaRa is a feed-forward 2D Gaussian Splatting model that reconstructs radiance fields from large-baseline views, a single image, or a text prompt.
  • Figure 2: Pipeline. Our method represents objects as dense voxels filled with 2D Gaussian primitives. We first construct 3D feature volumes $\mathbf{V}_\text{f}$ by lifting 2D DINO features to a canonical volume, modulated by Plücker rays (Section \ref{['sec:representation']}). We then apply a volume transformer to reconstruct a Gaussian volume $\mathbf{V}_\mathcal{G}$ from the feature and embedding volumes (Section \ref{['sec:feedforward']}). We use a coarse-to-fine decoding process to regress 2D Gaussian primitive parameters (Section \ref{['sec:renderer']}), followed by rasterization for efficient rendering.
  • Figure 3: Volume Transformer. We aggregate the embedding volume $\mathbf{V}_\text{e}$ and feature volume $\mathbf{V}_\text{f}$ through a series of Group Attention Layers that progressively match features. In each layer, the volumes are first unfolded into local groups. Subsequently, a layer normalization is applied, followed by a GroupCrossAttn sublayer. This is followed by another normalization and an MLP layer. The output is reshaped back to the original embedding volume shape, processed by a 3D convolution layer, and forwarded to the next layer. To connect the output of the sublayers, we use residual connections.
  • Figure 4: Coarse-fine decoding. Top row: A "coarse" decoding module transforms the voxel features $\mathbf{V}^i_{\mathcal{G}}$ into $K$ 2D Gaussian parameters, representing shape (specifically, $\alpha, \mathbf{t}, \mathbf{S}, \Delta$) and appearance (denoted as $\text{SH}^{\textit{coarse}}$). This step is followed by a splatting procedure. On the bottom, a "fine" decoding module aggregates rendering buffers (i.e., RGB, depth, and alpha maps) from the coarse module, volume feature, and source images for appearance enhancement. It projects the centers of primitives onto these buffers, applies cross-attention with the voxel features $\mathbf{V}^i_{\mathcal{G}}$, and produces residual spherical harmonics $\text{SH}^{\textit{residuals}}$. These residuals are added to the coarse spherical harmonics for a refined splatting process.
  • Figure 5: Rendering results of unseen scenes. The top two rows compare our reconstructions with MVSNeRF Chen2021ICCVb, LGM tang2024lgm on Co3D reizenstein21co3d. We also show the view synthesis results for Gobjaverse gobjaverse, GSO GSO, and generative multi-view li2024instantd datasets, arranged from top to bottom. Note that visual results from MuRF are not shown due to their lack of content, appearing as white images. The above results are reconstructed using 4 input views.
  • ...and 5 more figures