ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View
Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao
TL;DR
<3-5 sentence high-level summary> ARSS tackles the problem of generating temporally coherent novel views from a single image along a predefined camera trajectory. It introduces a decoder-only autoregressive framework that uses a video tokenizer to encode multi-view sequences, a camera autoencoder to inject 3D positional guidance via Plücker raymaps, and a permutation-based token ordering to maintain temporal causality while leveraging spatial context. The method achieves competitive performance with diffusion-based NVS on RealEstate-10K, ACID, and DL3DV in zero-shot settings, and demonstrates strong long-horizon stability. This work highlights a path toward autoregressive, causally-structured view synthesis and suggests future improvements in tokenizers and high-resolution training.
Abstract
Diffusion models have achieved impressive results in world modeling tasks, including novel view generation from sparse inputs. However, most existing diffusion-based NVS methods generate target views jointly via an iterative denoising process, which makes it less straightforward to impose a strictly causal structure along a camera trajectory. In contrast, autoregressive (AR) models operate in a causal fashion, generating each token based on all previously generated tokens. In this work, we introduce ARSS, a novel framework that leverages a GPT-style decoder-only AR model to generate novel views from a single image, conditioned on a predefined camera trajectory. We employ an off-the-shelf video tokenizer to map continuous image sequences into discrete tokens and propose a camera encoder that converts camera trajectories into 3D positional guidance. Then to enhance generation quality while preserving the autoregressive structure, we propose an autoregressive transformer module that randomly permutes the spatial order of tokens while maintaining their temporal order. Qualitative and quantitative experiments on public datasets demonstrate that our method achieves overall performance comparable to state-of-the-art view synthesis approaches based on diffusion models. Project page: https://wbteng9526.github.io/arss/.
