Table of Contents
Fetching ...

Scaling View Synthesis Transformers

Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann

TL;DR

Across several compute levels, it is demonstrated that the encoder-decoder architecture, which is called the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Abstract

Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Scaling View Synthesis Transformers

TL;DR

Across several compute levels, it is demonstrated that the encoder-decoder architecture, which is called the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Abstract

Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.
Paper Structure (34 sections, 13 equations, 14 figures, 10 tables)

This paper contains 34 sections, 13 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Scaling Laws for View Synthesis Transformers. Evaluated on RealEstate10K zhou2018stereo, our SVSM exhibits a $3\times$ more compute-optimal Pareto frontier than LVSM while retaining the same scaling behavior (similar slope and curvature everywhere).
  • Figure 2: Architectures of the current SOTA, the decoder-only LVSM jin2024lvsm (a) and SVSM (ours, b). Our cross-attention based decoder enables parallel rendering of multiple target views after a single scene encoding. Each target view is decoded independently given the shared scene representation, but the cross-attention allows these independent decodings to be executed in parallel.
  • Figure 3: Effective Batch Size. Training loss (smoothed with a rolling-average) and test PSNR measured throughout training across various paired $B$ and $V_T$ runs provide evidence for our effective batch size hypothesis: Models trained with the same product of number of scenes in the batch $B$ and number of reconstruction target views $V_T$, i.e. runs with the same effective batch size$B_{\text{eff}}$, perform the same and are colored identically. On $V_C = 8$(top), we sweep across $B_{\text{eff}} =$128, 256 on DL3DV, and on $V_C = 2$(bottom), we sweep across $B_{\text{eff}} =$128, 1024 on RealEstate10K.
  • Figure 4: Data and Model Scaling Plots. While our model (blue) is optimal when sufficient data is available, decoder-only LVSM (red) performs better with less data. The Pareto frontier analysis shows that our model is more data-hungry. Our model is also less parameter-efficient, although the gap closes as we increase the training compute. However, with sufficient data and compute, our model (blue) is overall superior in terms of training compute-optimality and rendering speed.
  • Figure 5: Qualitative Scaling Behavior, $V_C = 2$. From left to right, both models steadily increase in rendering quality until reaching near photo-realistic results. Compared vertically, for a given compute-budget, SVSM renderings consistently contain less artifacts.
  • ...and 9 more figures