Scaling Sequence-to-Sequence Generative Neural Rendering
Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C. Pérez, Zijian Zhou, Chi Phung, Tao Xiang, Juan-Manuel Pérez-Rúa
TL;DR
Kaleido reframes 3D rendering as a sequence-to-sequence problem by treating 3D as a specialized sub-domain of video and using a decoder-only rectified flow transformer trained on large-scale video data. It introduces a unified space-time positional encoding, principled view sampling, windowed attention, and SNR sampling tailored for rendering, enabling arbitrary numbers of reference and target views with full 6-DoF control. The model demonstrates state-of-the-art zero-shot novel view synthesis across object- and scene-level benchmarks and, with many views, matches the quality of per-scene optimization methods, while also delivering competitive 3D reconstruction results. These findings suggest a scalable, data-driven rendering engine capable of unified 3D and video modelling, reducing reliance on scarce camera-labelled 3D data and enabling flexible, high-fidelity multi-view generation. The work also identifies limitations in texture consistency, intrinsics handling, and speed, outlining clear directions toward faster, intrinsic-aware, and true 4D generative rendering.
Abstract
We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.
