Table of Contents
Fetching ...

Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention

Xiaosong Jia, Yihang Sun, Junqi You, Songbur Wong, Zichen Zou, Junchi Yan, Zuxuan Wu, Yu-Gang Jiang

TL;DR

Efficient-LVSM tackles the inefficiencies of monolithic transformer-based novel view synthesis by decoupling input-view encoding from target-view generation. The dual-stream architecture, featuring an Input Encoder with intra-view self-attention and a Target Decoder with self-attention plus cross-attention to encoder outputs, achieves linear-like scaling in the number of input views and enables KV-cache-based incremental inference. Enhancements such as intra-target attention, encoder–decoder co-refinement, and REPA distillation further boost fidelity, while KV-cache support and incremental rendering reduce latency. On RealEstate10K and Objaverse, Efficient-LVSM delivers state-of-the-art quality with substantially faster training and inference, plus strong zero-shot generalization to unseen input-view counts, marking a practical advance for scalable, geometry-free 3D view synthesis.

Abstract

Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.

Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention

TL;DR

Efficient-LVSM tackles the inefficiencies of monolithic transformer-based novel view synthesis by decoupling input-view encoding from target-view generation. The dual-stream architecture, featuring an Input Encoder with intra-view self-attention and a Target Decoder with self-attention plus cross-attention to encoder outputs, achieves linear-like scaling in the number of input views and enables KV-cache-based incremental inference. Enhancements such as intra-target attention, encoder–decoder co-refinement, and REPA distillation further boost fidelity, while KV-cache support and incremental rendering reduce latency. On RealEstate10K and Objaverse, Efficient-LVSM delivers state-of-the-art quality with substantially faster training and inference, plus strong zero-shot generalization to unseen input-view counts, marking a practical advance for scalable, geometry-free 3D view synthesis.

Abstract

Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.
Paper Structure (25 sections, 7 equations, 10 figures, 8 tables)

This paper contains 25 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Latent Novel View Synthesis Paradigms Comparison. The proposed decoupled architecture disentangles the input and target streams with lower $O(N_{in})$ complexity and no duplication of tokens.
  • Figure 2: Efficient-LVSM Model Structure. Efficient-LVSM patchifies posed input images and target Plücker rays into tokens. Input tokens pass separately through an encoder to extract context, while target tokens cross-attend to generate new views. Asterisks indicate shared parameters.
  • Figure 3: Vanilla Encoder-Decoder vs. Dual-Stream Co-refinement. (a) Hidden features in middle layers in vanilla encoder-decoder are wasted while the dual-stream co-refinement structure utilizes these features to extract more information. (b) Feature maps indicate that co-refinement structure catches more details of the target view.
  • Figure 4: Applying REPA into Efficient-LVSM. (a) Pretrained vision encoders and MLP projectors are discarded in inference. (b) Feature maps indicate that REPA helps the model extract semantics.
  • Figure 5: NVS Visual Comparison. We compare with LVSM jin2025lvsm in RealEstate10K realestate10k_zhou2018stereo and Amazon Berkeley Objects collins2022abodatasetbenchmarksrealworld. Images rendered by our model have less blur details.
  • ...and 5 more figures