Scaling View Synthesis Transformers

Evan Kim; Hyunwoo Ryu; Thomas W. Mitchel; Vincent Sitzmann

Scaling View Synthesis Transformers

Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann

TL;DR

Across several compute levels, it is demonstrated that the encoder-decoder architecture, which is called the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Abstract

Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Scaling View Synthesis Transformers

TL;DR

Abstract

Paper Structure (34 sections, 13 equations, 14 figures, 10 tables)

This paper contains 34 sections, 13 equations, 14 figures, 10 tables.

Introduction
Key Contributions:
Related Work and Preliminaries
Generalizable novel view synthesis.
Scaling Laws.
Extremely long context view synthesis.
Encoder-Decoder View Synthesis
The Effective Batch Size for View Synthesis
Analysis Setup.
Effective Batch Size is What Matters.
SVSM Enables Compute-Optimal Tradeoff.
Scaling Laws for Stereo (Vc=2) NVS
Scaling Laws.
Optimal Model Choice.
SVSM-420M/740M Results.
...and 19 more sections

Figures (14)

Figure 1: Scaling Laws for View Synthesis Transformers. Evaluated on RealEstate10K zhou2018stereo, our SVSM exhibits a $3\times$ more compute-optimal Pareto frontier than LVSM while retaining the same scaling behavior (similar slope and curvature everywhere).
Figure 2: Architectures of the current SOTA, the decoder-only LVSM jin2024lvsm (a) and SVSM (ours, b). Our cross-attention based decoder enables parallel rendering of multiple target views after a single scene encoding. Each target view is decoded independently given the shared scene representation, but the cross-attention allows these independent decodings to be executed in parallel.
Figure 3: Effective Batch Size. Training loss (smoothed with a rolling-average) and test PSNR measured throughout training across various paired $B$ and $V_T$ runs provide evidence for our effective batch size hypothesis: Models trained with the same product of number of scenes in the batch $B$ and number of reconstruction target views $V_T$, i.e. runs with the same effective batch size$B_{\text{eff}}$, perform the same and are colored identically. On $V_C = 8$(top), we sweep across $B_{\text{eff}} =$128, 256 on DL3DV, and on $V_C = 2$(bottom), we sweep across $B_{\text{eff}} =$128, 1024 on RealEstate10K.
Figure 4: Data and Model Scaling Plots. While our model (blue) is optimal when sufficient data is available, decoder-only LVSM (red) performs better with less data. The Pareto frontier analysis shows that our model is more data-hungry. Our model is also less parameter-efficient, although the gap closes as we increase the training compute. However, with sufficient data and compute, our model (blue) is overall superior in terms of training compute-optimality and rendering speed.
Figure 5: Qualitative Scaling Behavior, $V_C = 2$. From left to right, both models steadily increase in rendering quality until reaching near photo-realistic results. Compared vertically, for a given compute-budget, SVSM renderings consistently contain less artifacts.
...and 9 more figures

Scaling View Synthesis Transformers

TL;DR

Abstract

Scaling View Synthesis Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (14)