Table of Contents
Fetching ...

Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C. Pérez, Zijian Zhou, Chi Phung, Tao Xiang, Juan-Manuel Pérez-Rúa

TL;DR

Kaleido reframes 3D rendering as a sequence-to-sequence problem by treating 3D as a specialized sub-domain of video and using a decoder-only rectified flow transformer trained on large-scale video data. It introduces a unified space-time positional encoding, principled view sampling, windowed attention, and SNR sampling tailored for rendering, enabling arbitrary numbers of reference and target views with full 6-DoF control. The model demonstrates state-of-the-art zero-shot novel view synthesis across object- and scene-level benchmarks and, with many views, matches the quality of per-scene optimization methods, while also delivering competitive 3D reconstruction results. These findings suggest a scalable, data-driven rendering engine capable of unified 3D and video modelling, reducing reliance on scarce camera-labelled 3D data and enabling flexible, high-fidelity multi-view generation. The work also identifies limitations in texture consistency, intrinsics handling, and speed, outlining clear directions toward faster, intrinsic-aware, and true 4D generative rendering.

Abstract

We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

Scaling Sequence-to-Sequence Generative Neural Rendering

TL;DR

Kaleido reframes 3D rendering as a sequence-to-sequence problem by treating 3D as a specialized sub-domain of video and using a decoder-only rectified flow transformer trained on large-scale video data. It introduces a unified space-time positional encoding, principled view sampling, windowed attention, and SNR sampling tailored for rendering, enabling arbitrary numbers of reference and target views with full 6-DoF control. The model demonstrates state-of-the-art zero-shot novel view synthesis across object- and scene-level benchmarks and, with many views, matches the quality of per-scene optimization methods, while also delivering competitive 3D reconstruction results. These findings suggest a scalable, data-driven rendering engine capable of unified 3D and video modelling, reducing reliance on scarce camera-labelled 3D data and enabling flexible, high-fidelity multi-view generation. The work also identifies limitations in texture consistency, intrinsics handling, and speed, outlining clear directions toward faster, intrinsic-aware, and true 4D generative rendering.

Abstract

We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets -- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

Paper Structure

This paper contains 38 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Kaleido is a generative rendering engine that can synthesise any number of photorealistic novel views across diverse artistic styles from any number of reference images (white boxes) with arbitrary 6-DoF camera poses.
  • Figure 2: Rendering as Sequence-to-Sequence Image Modelling. We propose that neural rendering can be framed as a sequence-to-sequence task, unifying its design with language and video generation. In this formulation, a transformer vaswani2017transformer learns to generate image tokens conditioned on their spatial positions, similar to how language models condition on token positions in a sequence, and video models condition on temporal positions across frames.
  • Figure 3: Kaleido Design Ablations. We extensively ablate various architectural designs and training strategies to explore effective scaling strategies for generative neural rendering. Each ablation experiment was conducted with Kaleido-Small, trained for 100K steps in total, on a mixture of Objaverse and uCO3D sampled randomly. We report PSNR and training throughput for each configuration and evaluate performance in two settings: 1 target view conditioned on 5 reference views, and 5 target views conditioned on 5 reference views. We broadly split our designs into four categories: the Kaleido architecture design spaces (i–v); scaling stability techniques to handle large activations (vi–vii); training and inference timestep sampling strategies (viii–ix); and the role of video pre-training (x). The arrow $(\to)$ indicates the progression from our initial baseline design to our final, optimised design choice.
  • Figure 4: Kaleido Architecture Design Details. Kaleido is designed with a simple and scalable decoder-only transformer. It processes a sequence of tokens with clean reference latents (concatenated with their DINOv2 features) and noised target latents. During training, a single timestep $t$ is sampled per scene and integrated into the network via AdaIN layers, similar to DiT peebles2023dit. The core of the model consists of repeating blocks of spatial self-attention (for within-frame interactions) followed by temporal window attention (for cross-frame interactions), and a SwiGLU feed-forward layer. Within each attention block, we encode a unified positional encoding design based on Geometric Transformation Attention (GTA) Miyato2024GTA, which consistently represents all 2D, 3D, and temporal positions. This enables the same architecture to be trained on both video and multi-view 3D data without architectural changes.
  • Figure 5: Visual Analysis of Massive Activations in a Rectified Flow Transformer. We provide an empirical analysis of massive activations emerging during training. (a) Visualisation of activation magnitudes across model layers at 100K training steps, showing they are sparse but grow suddenly at a middle layer. (b) The maximum activation magnitude (measured at the final layer) grows over training time and correlates positively with image resolution (and thus, number of tokens). (c) The same training configuration as (a), but with learnable register tokens applied, demonstrating a significant and consistent reduction in activation magnitudes.
  • ...and 5 more figures